PCLinuxOS Magazine

Command Line Interface Intro:
Part 9

by Peter Kelly (critter)

Almost everything that you do on the command line involves moving, changing, comparing or deleting text data. This data may reside in a file on a disk drive or be generated as the output from a previous command in the form of a data stream.

When the UNIX operating system was first developed at Bell Laboratories in the early 1970's, Ken Thompson, who is generally regarded as the chief architect of the project, was keen to implement a system of inter-connecting streams of data as an alternative to using a series of discrete processes to achieve the required output. Today we refer to this system as 'pipes' and 'redirection'.

The software tools available in those dark days were rather primitive, but have mostly survived and evolved into what we use today. The program ed has survived mostly unchanged in its usage since those times (which probably accounts for its lack of use today). ed is a line editor, unlike text editors such as vi(m). A line editor reads in a file and works on one line at a time, not on the whole file. You make changes to the line and then move to another line.

Just as the program was simple, and so were the commands. You used p to print the line to the terminal so that you can actually see what you are editing (this is not done automatically), d to delete the line, s to substitute some text for some other text, but only in that line. To edit a large text file interactively, by hand, this is far too restrictive. So the text editors that we more commonly use today were developed.

With the introduction of pipes for streams of data through this method of editing line by line, non-interactively is ideal and so a new tool was introduced known as a stream editor. This reads in data and applies a series of commands to the data as it flows through. These commands, deletions, substitutions etc, could be supplied on the command line, or read in from a file or script. If the input data is from a file, then that file is not changed. Only the data in the output stream is affected, and this can be saved as a new file or further processed along the 'pipeline'.

As a model for this new tool the ed editor was chosen and named sed — stream editor, which you may have heard of.

SED

The sed utility retains a lot of the simplicity of commands it inherited from ed, but it adds a lot more functionality. Its command line or script can be a bewildering gibberish text when you first encounter it.

sed -n -e 's/M/ MegaBytes/;s/-.\{12\}$.\.. MegaBytes$ $[0-9]\{4\}$-$[0-9][0-9]$-$[0-9][0-9]$ ..:.. $.*$$/\4\/\3\/\2 \1 \5/p' sed-demo

If we break down this gibberish into manageable phrases, then it becomes more comprehensible. It really does, trust me. Now before you throw your hands in the air and say "This is not for me!" let me say that it is very unlikely that you would ever need to construct such a complicated command.

Here's something a lot simpler and is actually useful. Many Linux users also use MS Windows, but if you try to read a Linux created text file in Windows, then you find that the line breaks don't work and extend to the full width allowed by the editor, probably notepad. This is because Linux terminates its lines with a newline character \n, while DOS and Windows need a newline and a carriage return pair \n\r (just like the old typewriters, where you move the paper up a line and push the carriage back to the beginning of the line). A newline on its own is not recognized as a line termination. Sed makes light work of this.

sed 's/$/\r/' linux-file > dos-file makes Linux files DOS-readable.

sed 's/\r//' /dos-file > linux-file converts them back, although this usually isn't necessary as Linux will disregard the extra carriage returns.

It would be a trivial matter to put these two files in a script and create a couple of aliases to them, perhaps l2d and d2l.

The syntax of sed is very simple:

sed {options} {commands} {file}

options

The most useful options available for GNU sed, which is the version that Linux users will most likely find that they have installed, are:

-e is required when you are specifying commands on the command line and tells the application that what follows should be treated as a command to be applied to the current line. The option can be repeated on the command line to apply multiple commands to the input data.

-n silent mode, don't automatically print the lines to stdout.

-f script add the contents of the named script to the commands to be executed

-r use the extended set of regular expressions (like egrep)

--posix disable gnu extensions. This makes scripts portable to systems that have the standard unix-like version of sed

--help covers all the options for the version that you are using.

commands

These define what you do to the data as the stream passes through, and I will describe the use of the most important ones in the body of this text.

file

This is the input data stream, and if the file name is supplied on the command line, it is treated as stdin. That is, sed command file and sed command < file mean the same thing. The input may also be piped in to the command, e.g. cat file | sed command or ls -l | sed command.

To get started, find or create a file to play around with. I have used a short listing of my /boot directory, in a file named sed-demo, ls -AlLGgh /boot >sed-demo, which looks like this:

total 31M
-rw-rw-r-- 1 440 2010-04-02 10:59 boot.backup.sda
-rw-r--r-- 1 111K 2010-04-03 15:11 config
-rw-r--r-- 1 108K 2010-03-16 15:11 config-2.6.32.10-pclos2.pae
-rw-r--r-- 1 111K 2010-04-03 15:11 config-2.6.33.2-pclos1.pae
-rwxr-xr-x 1 579K 2010-04-02 10:59 gfxmenu*
drwxr-xr-x 2 4.0K 2010-04-05 04:11 grub/
-rw------- 1 6.4M 2010-04-02 11:35 initrd-2.6.32.10-pclos2.pae.img
-rw------- 1 6.4M 2010-04-04 08:56 initrd-2.6.33.2-pclos1.pae.img
-rw------- 1 6.4M 2010-04-04 08:56 initrd.img
-rw-r--r-- 1 1.5K 2010-04-10 14:04 kernel.h
-rw-r--r-- 1 1.5K 2010-04-02 11:59 kernel.h-2.6.32.10-pclos2.pae
-rw-r--r-- 1 1.5K 2010-04-10 14:04 kernel.h-2.6.33.2-pclos1.pae
-rw-r--r-- 1 249K 2006-11-05 23:23 message-graphic
-rw-r--r-- 1 1.4M 2010-04-03 15:11 System.map
-rw-r--r-- 1 1.4M 2010-03-16 15:11 System.map-2.6.32.10-pclos2.pae
-rw-r--r-- 1 1.4M 2010-04-03 15:11 System.map-2.6.33.2-pclos1.pae
-rw-rw-r-- 1 256 2010-04-02 10:59 uk-latin1.klt
-rw-r--r-- 1 2.0M 2010-04-03 15:11 vmlinuz
-rw-r--r-- 1 2.0M 2010-03-16 15:11 vmlinuz-2.6.32.10-pclos2.pae
-rw-r--r-- 1 2.0M 2010-04-03 15:11 vmlinuz-2.6.33.2-pclos1.pae

This file contains a mixture of lines of varying length, and fields of differing construction. To select only data that meets certain criteria, and to re-format parts of it to more accurately meet my requirements, would be very difficult without a utility like sed.

The changes I want to make to this set of data are:

Remove the total count
Keep only regular files, no links,directories etc.
Remove the permissions fields
Remove the link counts
Keep only lines that contain files of 1MB or larger
Change 'M' to 'MegaBytes'
Change the date format from year-month-day to day/month/year
Remove the time field
Output the date size and file name — in that order.

Now that looks like a lot of work, but thanks to the flexibility of sed, I can do it in one command.

To get rid of the line 'total 31M' and leave only the lines with file details, I could issue the following command:

sed -e '/total/d' sed-demo

This is the beginning of the output from this command.

The line at the start of the listing that contained the expression total has disappeared from the output.

So what did I do here? I issued the sed command with the -e option, which told sed to treat the next command line argument, '/total/d', as a command to apply to the input file sed-demo.

What seddid was to read in the entire sed-demo file line by line into an area of memory known as pattern space and examined each line to see if it could match the regular expression total, which is surrounded by a pair of slashes. Whenever a match was found, sed applied the d command, which deletes the current line from pattern space. This results in no output from sedfrom the analysis of that line. Lines that do not contain a pattern match are unaffected and flow through the command to stdout, which in this case is the terminal, as output has not been redirected elsewhere.

While that simple example of sed usage is not difficult to follow, the key phrase here is 'regular expression,' and a good understanding of regular expressions is required to make effective use of this command.

We covered the basics of regular expressions when we discussed the grep command, so perhaps a refresher is in order.

A regular expression is a sequence of literal characters and meta-characters. Literal characters are treated exactly as they are written and are case sensitive. Meta-characters have a special meaning in regular expressions, and must be expanded to produce the search pattern from the regular expression. These are the basic meta-characters:

.: The dot character matches any single character.
*: The asterisk matches zero or more occurrences of the preceding character. This is not the same behavior as the shell wild-card character.
^: The caret is a tricky one, it has two meanings. Outside of square brackets it means match the pattern only when it occurs at the beginning of the line, this is known as an anchor mark. As the first character inside a pair of brackets it negates the match i.e. match anything except what follows.
$: Another anchor mark this time meaning to only match the pattern at the end of a line.
\< \>: More anchor marks. They match a pattern at the beginning \< or the end \> of a word.
\: The backslash is known as the escape or quoting character and is used to remove the special meaning of the character that immediately follows it.
[ ]: Brackets are used to hold groups, ranges and classes of characters. Here a group can be an individual character.
\{n\}: Match n occurrences of the preceding character or regular expression. Note that n may be a single number \{2\}, a range \{2,4\} or a minimum number \{2,\} meaning at least two occurrences.
: Any matched text between $ and $ is stored in a special temporary buffer. Up to nine such sequences can be saved and then later inserted using the commands \1 to \9.

In the previous example, the regular expression we used, total, contained only literal characters. But more usually, you will build up a regular expression from literals, meta-characters and character classes such as [:digit::] or [:space:]. The use of meta characters in regular expressions enables you to very quickly match quite complicated or unknown patterns. Some examples:

sed -e '/^#'d' .bashrc Would strip out any comments from your .bashrc file as comments begin with a #.

sed -e '/^$/d' .bashrc would remove any blank lines by matching the beginning and end of the line with nothing in between.

It is quite safe to try these out since the source file is not altered. Only the output to the terminal is changed.

In my test file, I have one directory, /grub, and this is denoted by the letter d at the beginning of the line. To remove the line, we can use sed's delete command with a regular expression that matches only that line.

sed -e '/^d/d' sed-demo matches all lines beginning, that's the ^, with d, and applies the delete command. The command is single quoted to prevent shell expansion of meta characters. Recall that single quotes are known as strong quotes, and protect the contents from the effects of shell expansion, which wouldn't have had any effect here, but it is a good habit to get into.

To keep directories and remove all other lines, we need to reverse the effect of the command, which we can do with:

sed -n -e '/^d/p' sed-demo

The -n turns off automatic echoing of pattern space to the terminal, and the p command, on finding a pattern match, prints the current contents of pattern space to stdout which, as the output has not been re-directed, is the terminal.

Alternatively we could look for lines that begin with a hyphen, and that would also exclude anything that wasn't a regular file.

sed -n -e '/^-/p' sed-demo

The directory line has been removed but so has the total line as that also did not begin with a hyphen. In this case it helps, but we have to be extremely careful about what we want to include or exclude. Similarly, we can remove all lines that do not contain an uppercase M followed by a space to keep only files of one MB or larger. Without the space, the total line would be included, as that also contains an uppercase M but no trailing space.

sed -n -e '/M /p' sed-demo

In the first example, I matched the pattern total to remove the first line, but I could more easily have specified an address.

sed -e '1d' sed-demo the 1 is the line number that I want to remove. Addresses can be ranges, so sed -e '8,20d' will remove lines 8 to 20 from the output.

Notice that the total line and the directory line are still in the output, as the original data has not been altered.

In this case, I knew the address was 1, but usually you have to search for it. You do this by specifying a regular expression surrounded by slashes. The address of the line to delete in the first example was given by matching the regular expression /total/.

Substitution

Now that we have a means of keeping only those lines that we want in our final data set, we need to change some of that data. Probably the most used command for sed is s, to substitute one regular expression for another. The format for this is:

sed -e 's/old/new/' {file}

So that the command:

sed -e 's/M/ MegaBytes/' sed-demo

would change all the uppercase Ms to 'MegaBytes' (note the preceding space) in my test file. Note that sed by default only matches the first occurrence of the pattern on each line. If you need to match every occurrence, which is often what you want, then you have to add the g -global command:

sed -n -e 's/r/R/p' sed-demo would only replace the first r with R.

sed -n -e 's/r/R/gp' sed-demo replaces every occurrence.

Combining two commands we can make a substitution and output only the lines that we want to keep.

sed -n -e 's/M/ MegaBytes/' -e '/Mega/p' sed-demo

The substitute command can also be used to good effect to delete a part of the line. To remove the time field, we could match two characters followed by a colon followed by two characters and replace it with something like this:

sed -e 's/..:..//' sed-demo

To remove the permissions, the link count and trailing space at the start of the line, we could match a hyphen followed by exactly 12 characters by using the dot, \{ and \} metacharacters.

sed -e 's/-.\{12\}//' sed-demo

The dot matches any character, and the number inside the escaped braces tells the command how many matches to make. In other words, match exactly 12 characters.

One thing to be aware of when using regular expressions with meta characters is that they are greedy. They will always try to match the longest possible string.

If you try to remove the permissions with a command like this:

sed -n -e 's/-.*-//p' sed-demo

looking for a hyphen followed by some characters followed by a hyphen, then you might be disappointed to see that it matched strings like this:

-rw-r--r-- 1 111K 2010-04-03 15:11 config-2.6.33.2-

and output only

pclos1.pae.

Putting all this together makes for a pretty long command line, so I have used the shell line continuation character, the backslash, to make it more legible. But remember that it is all one line, as far as the shell is concerned.

The skill in using sed is recognizing what you want to match, and the building of a regular expression that matches that part of the line, and only that part. This comes with practice and an understanding of regular expressions. Matching the file name is quite tricky, as there seems to be no 'standard format' that could be easily matched. So the easiest way out is to match everything else.

sed -e 's/-.\{34\}//' sed-demo

Here's the start of the output.

No, I didn't count all 34 characters. I took a guess, tried it and adjusted it. This trial and error method is quite common when building regular expressions, although not everyone admits it.

If you noticed that the total line is in the output, it is because it doesn't begin with a hyphen and we hadn't already removed it. The order of operation of your commands can have a great effect on the resultant output.

sed -e 's/[0-9]\{4\}-[0-9][0-9]-[0-9][0-9]//' sed-demo

matches the date part of the line and removes it. Here's how it works.

The first slash starts the search expression.

[0-9]\{4\}: matches exactly 4 digits.
-: matches a literal hyphen.
[0-9][0-9]: matches 2 digits.
-: matches a literal hyphen.
[0-9][0-9]: matches 2 digits.

Command Line Interface Intro:Part 9

by Peter Kelly (critter)

SED

commands

file

Substitution

Reach Us On The Web

Command Line Interface Intro:
Part 9