COP 2344 (Shell Scripting) Mini-Tutorial:
Variables, Control Structures, and Functions for Non-Programmers

 

Written 3/2009 by Wayne Pollock, Tampa Florida USA

Understanding Variables:

A variable is a named location in memory that holds data.  This is useful since it is common to store user input or the results of calculations, so they can be referred to later.  Every location in the memory of a computer is numbered (the RAM address).  To look at some previously stored data you would need to know the memory address (and the length of the data) where you put it.  It can be painful to try to remember what bit of data you put into location 223,506,080.

Instead we tell the computer the name of some data (such as a password a user entered).  Humans don't care where in memory this data is stored, we can let the computer handle that.  Later we can refer to that location by the name we picked.  You would retrieve the user's password from the location named password.  This named location in memory is a variable.  The data stored in that location is called the variable's value.

While the exact details of using variables can differ slightly from one scripting language to the next, the basic idea of a named location in memory is the same.  A variable can only hold one piece of data (value) at a time.  If you put a new piece of data into some variable, whatever data was there before is gone.

When picking names for your variables it pays to pick descriptive names.  Most languages have rules and conventions for variable names.  A common rule is that the names must start with a letter and consist of letters and digits only.  Usually an underscore _ counts as a letter.)  Also you can't (and shouldn't even if some language would allow it) use reserved words for variable names, such as if or while.

There are also naming conventions you should follow.  Shell variables are typically all uppercase letters, for example:  TOTAL, SUB_TOTAL, SALES_TAX, etc.  The convention used with awk is to use all lowercase letters, with an underscore to separate words:  total, sub_total, sales_tax, etc.  (Some languages prefer to avoid using an underscore and instead use camel-casetotal, subTotal, salesTax, numStudentsPerClass, etc.)

Some computer languages require you to declare your variables before you can use them.  This means to tell the computer the name of the variable before you begin to use it.  Few scripting languages require this though.

Some computer languages only allow a variable to hold a single type of data; that is the variable foo can hold either numbers or text, but you must decide which when you declare the variable.  (Sometimes there is a difference made between integers (whole numbers such as 0, 16, -1, etc.) and floating point values (such as 3.14).  Again, this is rare for scripting languages.  Most of the time the value of some variable will be interpreted as a number or as text depending on what you try to do with that value.  (For example if you add 2 to some variable X as in Y=X+2, X will be interpreted as a number.  If you try to print out the value of a variable it is likely to be interpreted as text.)

Some computer languages allow a variable to have attributes such as read-only, in addition to its value.  (Shell variables can have attributes.)

Finally, some computer languages allow variables to hold a list of values, rather than a single value.  Such a variable is often called an array.  (Regular variable that hold a single value only are known as scalar variables.)  With an array variable, the name refers to the whole list.  To access some particular value in that list, you supply an index.  For example, of STUDENT is an array of student names, STUDENT[1] might refer to the first name, STUDENT[2] to the second, and so on.  (In many languages the first value in the array has an index of zero, not one.)  While most language require an integer for an index, awk uses a string of text.  If you supply a number for an index awk will convert it to a string for you.  This feature makes it easy to use code such as days_in_month["January"] = 31, but it is harder to iterate (loop over) each value in the array.

The shell has variables with attributes.  These don't need to be declared.  They can hold a string of text, but if you do arithmetic using these variables the shell will attempt to convert the text to a number first.  Numeric values are limited to integers.  (Of course some shells have extensions to support floating point, arrays, etc.)  To set a value for some variable is called assignment, and is easy to do.  For the shell you can assign a value to a variable this way:

NAME=VALUE

(With no spaces around the equals sign.)  To recall the value of a variable, you use the name with a dollar-sign in front:

echo $VALUE

In awk variables are even easier to use.  You can have spaces around the equals-sign when assigning a value to one, and no dollar-sign is needed to recall the value:

total = subtotal + tax
print "the total is " total

One last point:  When assigning a value to a variable, the variable name always goes on the left-hand side of the equals sign.  The right-hand side is any expression, which may be just a variable's value or a complex calculation.  The assignment statement always works the same way; first the value of the expression on the right is calculated.  Then that value is stored in the variable named on the left.  This sometimes confuses non-programmers when they see assignment statements such as this:

COUNTER = COUNTER + 1

Understanding Control Structures (if statements and loops):

A human is smarter than any computer.  When a human performs a series of steps in some procedure, such as following a recipe to bake a cake, a human can adjust the procedure when necessary.  (For example if you observe the batter is flying out of the mixing bowl you can slow down the electric mixer speed, even though that isn't mentioned in the recipe.)

Sadly a computer will always blindly follow the procedure (or script) when carrying out a sequence of steps.  The normal sequence, called the flow of control, is sequential: do one step and when finished with that, do the next.  If one step results in file not found or disk full the next step is still attempted.  The result is very much like a kitchen with the cake batter spattered all over the walls and ceiling: a big mess.

A control structure allows a program to alter the sequential flow of control (the sequence of steps it performs).  The human script writer (or programmer, but I don't want to scare you by using that term too often :-) can anticipate possible problems and test for them.  The script can then carry out different sequences of steps depending on the result of the test.

The control structure can perform or skip either a single step (or statement) or a whole set of steps.  The set of steps is called a block.  While most languages use curly braces ({ and }) to indicate the beginning and end of a block, others use special keywords (reserved words) instead.

There are no restrictions on what kind of statements or how many statements can be used in a block.  And since a whole control structure is just a complex statement as far as the computer is concerned, the block of one control structure can contain other control structures; this is called nesting.

There are two basic types of control structures: selections and loops.

Using Selection Statements

A selection carries out one of several steps, depending on the result of a test.  The simplest control structure is the if statement:

total = subtotal + tax
if ( total > 1000 )
then   total = total − discount
print total

Here the code checks if the total qualifies for a discount, and if so then applies a discount.  If not then that step is skipped.  The statement that is done or skipped is the body.  In this case we call the body the then statement(s).

The above code won't work for AWK; you don't use a then keyword.  In most languages you need to use a variation of this if statement, but the idea is the same.  For example in a shell script the if statement would look like this:

total=$((subtotal + tax))
if test "$total" -gt 1000
then   total=$((total - discount))
fi

The body can be a single statement as shown here, but can also be a sequence of statements.  As mentioned previously such a group of statements is known as a block statement.  In AWK a block statement is delimited by curly braces.  In shell you just put as many statements as you want between the then and the fi keywords; no curly braces are used.

The if statement shown above, using a block for the body, would look like this in AWK:

total = subtotal + tax
if ( total > 1000 )
{
   total = total − discount
}
print total

There is no restrictions on the number or type of statements that can appear in the body.  You can even have other if statements in the body.  This is called nested if statements.

To make your script more readable you should use plenty of spaces and blank lines, much like a novel uses extra space around paragraphs.  A common convention is to indent each line of a block by 3 or 4 spaces (try to avoid using a TAB character).  For example in awk the following are all legal but which style do you find the most readable?

total = whatever
if ( total > 1000 ) total = total − discount
print total

Or:

total = whatever
if ( total > 1000 )
   total = total − discount
print total

Or:

total = whatever
if ( total > 1000 ) {
   total = total − discount
}
print total

There are other selections, such as the two-way:

generate_report in file foo
if no file generated   
#  that is, something went wrong
then   display error message on console
else   print report

This is called an if-then-else statement.  It has two bodies (or clauses), the then statements and the else statements.  As usual the body is either a single statement or a block statement.

When the condition (the test) evaluates to true, the then statements are done and the else statements are skipped.  On the other hand, if the condition evaluates to false the then statements are skipped and the else statements are done.

Most languages also provide multi-way selection statements, where only one of many alternative bodies will be done.  However such control structures aren't needed since you can achieve the same effect with nested if statements, arranged in what is called an if ladder or an if chain.  Here's an example in AWK (note the indenting style used):

if ( score >= 90 )
   print "You got an 'A'."
else if ( score >= 80 )
   print "You got an 'B'."
else if ( score >= 70 )
   print "You got an 'C'."
else if ( score >= 64 )
   print "You got an 'D'."
else
   print "You got an 'F'."

The exact details (or syntax) of these statements depends on the computer language you're using.  Mostly you provide some expression that gets evaluated to a true or false value.  This is called the condition expression, or the test expression.  If true, the first set of statements (called the then statements) is done.  If false, that first set is skipped; if a second set is provided (called the else statements) those will be done instead.

Since shell will be discussed later in the course, here's some sample awk code showing an if statement in some action.  See if you can spot the condition expression, the then statements, and the else statements (remember in awk the int function truncates a number to an integer by dropping any fraction):

total = ...
print "The total is "
if ( total < 1.00 ) {
   cents = int( total * 100 )
   print cents " cents"
} else {
   dollars = int( total )
   cents = int( (total − dollars) * 100 )
   print dollars " dollars and " cents " cents"
}

Using Loops

A loop is a set of statements that you run through more than once.  The set of statements is called the loop body.  In many ways loops are similar to selection statements; A condition expression is evaluated.  If true the loop body is executed, and the condition is evaluated again.  This repeats until the condition evaluates to false.

One complete loop cycle is called an iteration.  Programmers often say something such as this code will iterate over the loop 10 times to mean ten loop cycles (iterations) will be done.

The simplest loop is a while loop.  In most languages (including awk) it will look like this:

...
printf "Would you like to play a game? "
read answer     # Not really an awk statement
while (answer == "yes" ) {
   play the game
   ...
   printf "Play again? "
   read answer
}

This type of loop is known as a sentinel loop.  With this type of loop, even looking at the program and data you can never be sure if you'll execute the loop body another time.  You just have to check after each iteration of the loop.  Sentinel loops are often used to process data when you don't know how many records there are; you keep looping as long as there is more data to process.  (The name come from the fact that early computers couldn't safely tell when there was no more data to process, so programmers had to add a special data record, known as the sentinel value, to mark the end of the data.)

There are some variations of sentinel loops available in most computer languages.  The while loop shown above does the test first, and only if true executes the loop body.  Another kind of sentinel loop first executes the loop body, then evaluates the test expression; this type of loop will always do the loop body at least once.

Occasionally it can be useful to test in the middle of the loop body, rather than at the top or the bottom.  You can achieve this by using an infinite loop that contains an if statement in the middle.  When the if statement is true, it causes the loop to exit.  For example:

   while ( 1==1 ) {
      printf "enter a number, 0 to exit:"
      read number
      if ( number == 0 )
         break
      do_something_interesting
   }

The other type of loop is called a counting loop.  With this type of loop the loop body is executed a fixed number of times.  There is still a test expression, to check if you've run through the loop body enough times.  (Note it isn't necessary to know when you write the script the number of times to run through, or iterate over, the loop body.  It is still a counting loop as long as you know before running through the loop the first time.  So if you count the number of records in a data file, you could use a counting loop to process all records rather than a sentinel loop.)

Most languages provide a for loop that is useful to implement counting loops.  (While the shell has a for loop, it works differently than the for loop in AWK and other languages.)  The for loop statement looks like this:

for ( initial; test; increment )
    loop body

Each of the three parts is optional and may be omitted.  (However you still need the semi-colons!(  As always the body can be a single statement or a block statement, and can contain any other statements you wish, including other loops or selection statements.  (Remember this is called nested control structures.)

The initial part (if any) is done first.  No matter how many times we go around the loop this part is only done once.

Now comes the real work of the loop.  The condition (test) is evaluated.  (A missing test is always true. ) If the condition evaluates to false, the loop is finished; the loop body is never executed.  If the condition evaluates to true, the loop body is done next.  Finally the increment is done, if present.

The above is one complete iteration of the loop.  When done with one iteration the next is done, starting with the re-evaluation of the condition.  Here's an example in AWK that prints Hello! ten times:

for ( i = 0; i < 10; i = i + 1 )
   print "Hello!"

It is very common to need to add a value to some variable, so many languages have convenient ways to do this.  For example the above increment could have been written as i += 1.  Adding 1 (one) to a variable is even more common and most computer languages have a special operator that does exactly that.  So the increment could be written as ++i or i++.

The variable used to count in a for loop is known as the loop control variable.  While you should always try to pick relevant names for variables, there often isn't a good choice for a loop control variable and i is usually used.  (And if you need more loop control variables in a script, use j, k, and so on).

A for loop doesn't have to count up only, or by ones.  Here's some samples; see if you can figure out what they do:

for ( i = 1; i <= 10; ++i )    print i
for ( i = 10; i > 0; --i )     print i
for ( i = 1; i <= 10; i=i+2 )  print i
for ( i = -5; i <= 5; ++i )    print i
for ( i = 100; i > 0;  i=i-10) print i

A for loop doesn't add anything to the language that you couldn't do with a while loop instead.  The idea is to make programs more readable by using the type of loop that best matches your intent (sentinel or counting loop).  As an example, here's a while loop that prints Hello! ten times:

i = 0
while ( i < 10 )
   print "Hello!"
   i = i + 1

You can also express a sentinel loop using for, just omit the initial and the increment parts.  But the best idea is to use while for sentinel loops and for for counting loops.

Understanding Functions

A function is a named block of code.  A function may be called other names such as procedure, method, routine, subroutine, or subprogram.  To run the statements in a function you invoke (or call) it by using its name.  Most computer languages come with a number of built-in functions you can use.  You can also define new functions.

Here's a function written in POSIX shell:

$ greet() (
    TIME=$(LC_TIME=POSIX date '+%p') # either 'AM' or 'PM'
    if test "$TIME" = "AM"
    then MESSAGE="morning"
    else MESSAGE="afternoon"
    fi
    echo Good $MESSAGE, $USER!
)
$ greet
Good afternoon, wpollock!
$ 

In shell you can use curly braces instead to define a function, but then the statements will run in the current shell and not a sub-shell, so running greet would set or overwrite TIME and MESSAGE in the current environment.

Functions can be made more useful if you can pass in some data that they should work on.  For example consider a function that calculates the square of some number.  Unless there was some way to pass the number to the function, how would it know which number to square?  One way is to use a variable, something like this:

$ NUMBER_TO_SQUARE=6
$ square
36
$ 

But this is awkward and error-prone.  Instead it would be best to pass in the number to square as a parameter, exactly the same way you pass parameters to a shell script:

$ square 6
36
$ 

while the shell function is running, the parameters passed to the function can be accessed as if they were positional parameters, $1, $2, $*, $#, etc. (but not $0).  So we can write the shell function square like this:

$ square() {
    echo $(( $1 * $1 ))
}
$ square 5
25
$ 

The function ends after the last statement in the block runs.  Functions can be terminated early by using the return statement.  Once the shell function is done, the positional parameters of the shell are restored.

Unlike shell most computer languages the parameters are passed inside of parenthesis like this:

result = square(6)

And instead of positional parameters, you pride a local variable name to hold the parameter.  (It is local in that the variable can only be used within the function body; that is, locally.)  Another difference from shell is that functions can either print the results (like shell does) or return a value that can be stored in a variable or used in a calculation.

The shell doesn't need to return a value because you can use a shell function with command substitution, like this:
      RESULT=$(square 6)

AWK has many built-in functions you can use, including many math and string functions.  One of the most useful is rand() with returns a random number; this is the only POSIX standard way to do this!  Note in AWK (as with most scripting languages) to invoke a function you always follow the function name with parenthesis, even when not passing in any parameters.

Here's how you can define and use a function in AWK:

$ awk '
function square ( number ) {
   return number * number;
}
BEGIN { print square(3) }
'
9
$