11 String Processing

 11.1 String concatenation
 11.2 Obtain the length of a string
 11.3 Find the position of a substring
 11.4 Extracting a substring
 11.5 Split a string into an array
 11.6 Changing case
 11.7 String substitution
 11.8 Formatted Printing

The C-shell has no string-manipulation tools. Instead we mostly use the echo command and awk utility. The latter has its own language for text processing and you can write awk scripts to perform complex processing, normally from files. There are even books devoted to awk such as the succinctly titled sed & awk by Dale Dougherty (O’Reilly & Associates, 1990). Naturally enough a detailed description is far beyond the scope of this cookbook.

Other relevant tools include cut, paste, grep and sed. The last two and awk gain much of their power from regular expressions. A regular expression is a pattern of characters used to match the same characters in a search through text. The pattern can include special characters to refine the search. They include ones to anchor to the beginning or end of a line, select a specific number of characters, specify any characters and so on. If you are going to write lots of scripts which manipulate text, learning about regular expressions is time well spent.

Here we present a few one-line recipes. They can be included inline, but for clarity the values derived via awk are assigned to variables using the set command. Note that these may not be the most concise way of achieving the desired results.

11.1 String concatenation

To concatenate strings, you merely abut them.

       set catal = "NGC "
       set number = 2345
       set object = "Processing $catal$number."

So object is assigned to "Processing NGC 2345.". Note that spaces must be enclosed in quotes. If you want to embed variable substitutions, these should be double quotes as above.

On occasions the variable name is ambiguous. Then you have to break up the string. Suppose you want text to be "File cde1 is not abc". You can either make two abutted strings or encase the variable name in braces ({ }), as shown below.

       set name1 = abc
       set name = cde
       set text = "File $name""1 is not $name1"     # These two assignments
       set text = "File ${name}1 is not $name1"     # are equivalent.

Here are some other examples of string concatenation.

       echo ’CIRCLE ( ’$centre[1]’, ’$centre[2]’, 20 )’ > $file1".ard"
       gausmooth in=$root"$suffix" out=${root}_sm accept
       linplot ../arc/near_arc"($lbnd":"$ubnd)"

11.2 Obtain the length of a string

This requires either the wc command in an expression (see Section 10), or the awk function length. Here we determine the number of characters in variable object using both recipes.

       set object = "Processing NGC 2345"
       set nchar = ‘echo $object | awk ’{print length($0)}’‘      # = 19
       @ nchar = ‘echo $object | wc -c‘ - 1

If the variable is an array, you can either obtain the length of the whole array or just an element. For the whole the number of characters is the length of a space-separated list of the elements. The double quotes are delimiters; they are not part of the string so are not counted.

       set places = ( Jupiter "Eagle Nebula" "Gamma quadrant" )
       set nchara = ‘echo $places | awk ’{print length($0)}’‘     # = 35
       set nchar1 = ‘echo $places[1] | awk ’{print length($0)}’‘  # =  7
       set nchar2 = ‘echo $places[2] | awk ’{print length($0)}’‘  # = 12

11.3 Find the position of a substring

This requires the awk function index. This returns zero if the string could not be located. Note that comparisons are case sensitive.

       set catal = "NGC "
       set number = 2345
       set object = "Processing $catal$number."
       set cind = ‘echo $object | awk ’{print index($0,"ngc")}’‘      # =  0
       set cind = ‘echo $object | awk ’{print index($0,"NGC")}’‘      # = 12
  
       set places = ( Jupiter "Eagle Nebula" "Gamma quadrant" )
       set cposa = ‘echo $places | awk ’{print index($0,"ebu")}’‘     # = 16
       set cposn = ‘echo $places | awk ’{print index($0,"alpha")}’‘   # =  0
       set cpos1 = ‘echo $places[1] | awk ’{print index($0,"Owl")}’‘  # =  0
       set cpos2 = ‘echo $places[2] | awk ’{print index($0,"ebu")}’‘  # =  8
       set cpos3 = ‘echo $places[3] | awk ’{print index($0,"rant")}’‘ # = 11

An array of strings is treated as a space-separated list of the elements. The double quotes are delimiters; they are not part of the string so are not counted.

11.4 Extracting a substring

One method uses the awk function substr(s,c,n). This returns the substring from string s starting from character position c up to a maximum length of n characters. If n is not supplied, the rest of the string from c is returned. Let’s see it in action.

       set caption = "Processing NGC 2345."
       set object = ‘echo $caption | awk ’{print substr($0,12,8)}’‘ # = "NGC 2345"
       set objec_ = ‘echo $caption | awk ’{print substr($0,16)}’‘   # = "2345."
  
       set places = ( Jupiter "Eagle Nebula" "Gamma quadrant" )
       set oba = ‘echo $places | awk ’{print substr($0,28,4)}’‘ # = "quad"
       set ob1 = ‘echo $places[3] | awk ’{print substr($0,7)}’‘ # = "quadrant"

An array of strings is treated as a space-separated list of the elements. The double quotes are delimiters; they are not part of the string so are not counted.

Another method uses the UNIX cut command. It too can specify a range or ranges of characters. It can also extract fields separated by nominated characters. Here are some examples using the same values for the array places

       set cut1 = ‘echo $places | cut -d ’ ’ -f1,3‘  # = "Jupiter Nebula"
       set cut2 = ‘echo $places[3] | cut -d a -f2‘   # = "mm"
       set cut3 = ‘echo $places | cut -c3,11‘        # = "pg"
       set cut4 = ‘echo $places | cut -c3-11‘        # = "piter Eag"
       set cut5 = ‘cut -d ’ ’ -f1,3-5 table.dat‘     # Extracts fields 1,3,4,5
                                                     # from file table.dat
  

The -d qualifier specifies the delimiter between associated data (otherwise called fields). Note the the space delimiter must be quoted. The -f qualifier selects the fields. You can also select character columns with the -c qualifier. Both -c and -f can comprise a comma-separated list of individual values and/or ranges of values separated by a hyphen. As you might expect, cut can take its input from files too.

11.5 Split a string into an array

The awk function split(s,a,sep) splits a string s into an awk array a using the delimiter sep.

       set time = 12:34:56
       set hr = ‘echo $time | awk ’{split($0,a,":"); print a[1]}’‘ # = 12
       set sec = ‘echo $time | awk ’{split($0,a,":"); print a[3]}’‘ # = 56
  
       # = 12 34 56
       set hms = ‘echo $time | awk ’{split($0,a,":"); print a[1], a[2], a[3]}’‘
       set hms = ‘echo $time | awk ’{split($0,a,":"); for (i=1; i<=3; i++) print a[i]}’‘
       set hms = ‘echo $time | awk ’BEGIN{FS=":"}{for (i=1; i<=NF; i++) print $i}’‘

Variable hms is an array so hms[2] is 34. The last three statements are equivalent, but the last two more convenient for longer arrays. In the second you can specify the start index and number of elements to print. If, however, the number of values can vary and you want all of them to become array elements, then use the final recipe; here you specify the field separator with awk’s FS built-in variable, and the number of values with the NF built-in variable.

11.6 Changing case

Some implementations of awk offer functions to change case.

       set text = "Eta-Aquarid shower"
       set utext = ‘echo $text | awk ’{print toupper($0)}’‘ # = "ETA-AQUARID SHOWER"
       set ltext = ‘echo $text | awk ’{print tolower($0)}’‘ # = "eta-aquarid shower"

11.7 String substitution

Some implementations of awk offer substitution functions gsub(e,s) and sub(e,s). The latter substitutes the s for the first match with the regular expression e in our supplied text. The former replaces every occurrence.

       set text = "Eta-Aquarid shower"
       # = "Eta-Aquarid stream"
       set text = ‘echo $text | awk ’{sub("shower","stream"); print $0}’‘
  
       # = "Eta-Aquxid strex"
       set text1 = ‘echo $text | awk ’{gsub("a[a-z]","x"); print $0}’‘
  
       # = "Eta-Aquaritt stream"
       set text2 = ‘echo $text | awk ’{sub("a*d","tt"); print $0}’‘
  
       set name = "Abell 3158"
       set catalogue = ‘echo $name | awk ’{sub("[0-9]+",""); print $0}’‘  # = Abell
       set short = ‘echo $name | awk ’{gsub("[b-z]",""); print $0}’‘      # = "A 3158"

There is also sed.

       set text = ‘echo $text | sed ’s/shower/stream/’‘

is equivalent to the first awk example above. Similarly you could replace all occurrences.

       set text1 = ‘echo $text | sed ’s/a[a-z]/x/g’‘

is equivalent to the second example. The final g requests that the substitution is applied to all occurrences.

11.8 Formatted Printing

A script may process and analyse many datasets, and the results from its calculations will often need presentation, often in tabular form or some aligned output, either for human readability or to be read by some other software.

The UNIX command printf permits formatted output. It is analogous to the C function of the same name. The syntax is

       printf "<format string>" <space-separated argument list of variables>

The format string may contain text, conversion codes, and interpreted sequences.

The conversion codes appear in the same order as the arguments they correspond to. A conversion code has the form

  %[flag][width][.][precision]code

where the items in brackets are optional.

The interpreted sequences include:
\n for a new line, \" for a double quote, \% for a percentage sign, and \\ for a backslash.

   



Format codes


Code

Interpretation



c

single character

s

string

d, i

signed integer

o

integer written as unsigned octal

x, X

integer written as unsigned hexadecimal, the latter using uppercase notation

e, E

floating point in exponent form m.nnnnne±xx or m.nnnnnE±xx respectively

f

floating point in mmm.nnnn format

g

uses whichever format of d, e, or f is shortest

G

uses whichever format of d, E, or f is shortest





Flags


Code

Purpose



-

left justify

+

begin a signed number with a + or -

blank

Add a space before a signed number that does not begin with a + or -

0

pad a number on the left with zeroes

#

use a decimal point for the floating-point conversions, and do not remove trailing zeroes for g and G codes



If that’s computer gobbledygook here are some examples to make it clearer. The result of follows each printf command, unless it is assigned to a variable through the set mechanism. The commentary after the # is neither part of the output nor should it be entered to replicate the examples. Let us start with some integer values.

       set int = 1234
       set nint = -999
       printf "%8i\n" $int            # 8-character field, right justified
           1234
       printf "%-8d%d\n" $int $nint   # Two integers, the first left justified
       1234    -999
       printf "%+8i\n" $int
          +1234
       printf "%08i\n" $int
       00001234

Now we turn to some floating-point examples.
       set c = 299972.458
       printf "%f %g %e\n" $c $c $c           # The three main codes
       299972.458000 299972 2.999725e+05
       printf "%.2f %.2g %.2e\n" $c $c $c     # As before but set a precision
       299972.46 3e+05 +3.00e+05
       printf "%12.2f %.2G %+.2E\n" $c $c $c  # Show the effect of some flags,
          299972.46 3.0E+05 +3.00E+05         # a width, and E and G codes

Finally we have some character examples of printf.
       set system = Wavelength
       set confid = 95
       set ndf = m31
       set acol = 12
       set dp = 2
  
       printf "Confidence limit %d%% in the %s system\n" $confid $system
       Confidence limit 95% in the Wavelength system  # Simple case, percentage sign
       printf "Confidence limit %f.1%% in the %.4s system\n" $confid $system
       Confidence limit 95.0% in the Wave system      # Truncates to four characters
       set heading = ‘printf "%10s: %s\n%10s: %s\n\n" "system" $system "file" $ndf‘
       echo $heading                                  # Aligned output, saved to a
           system: Wavelength                         # variable
             file: m31
  
       printf "%*s: %s\n%*s: %.*f\n\n" $acol "system" \
              $system $acol "confidence" 2 $confid
             system: Wavelength                       # Aligned output with a variable
         confidence: 95.00                            # width and precision
  
       set heading = ""                               # Form a heading by appending to
       foreach k ( $keywords )                        # variable in a loop.  Note the
          set heading = $heading ‘printf "%8s  " $k‘  # absence of \n.
       end
       echo ‘printf "%s\n" $heading

Note that there are different implementations. While you can check your system’s man pages that the desired feature is present, a better way is to experiment on the command line.