### 11 String Processing

The C-shell has no string-manipulation tools. Instead we mostly use the echo command and awk utility. The latter has its own language for text processing and you can write awk scripts to perform complex processing, normally from files. There are even books devoted to awk such as the succinctly titled sed & awk by Dale Dougherty (O’Reilly & Associates, 1990). Naturally enough a detailed description is far beyond the scope of this cookbook.

Other relevant tools include cut, paste, grep and sed. The last two and awk gain much of their power from regular expressions. A regular expression is a pattern of characters used to match the same characters in a search through text. The pattern can include special characters to refine the search. They include ones to anchor to the beginning or end of a line, select a specific number of characters, specify any characters and so on. If you are going to write lots of scripts which manipulate text, learning about regular expressions is time well spent.

Here we present a few one-line recipes. They can be included inline, but for clarity the values derived via awk are assigned to variables using the set command. Note that these may not be the most concise way of achieving the desired results.

#### 11.1 String concatenation

To concatenate strings, you merely abut them.

set catal = "NGC "
set number = 2345
set object = "Processing $catal$number."

So object is assigned to "Processing NGC 2345.". Note that spaces must be enclosed in quotes. If you want to embed variable substitutions, these should be double quotes as above.

On occasions the variable name is ambiguous. Then you have to break up the string. Suppose you want text to be "File cde1 is not abc". You can either make two abutted strings or encase the variable name in braces ({ }), as shown below.

set name1 = abc
set name = cde
set text = "File $name""1 is not$name1"     # These two assignments
set text = "File ${name}1 is not$name1"     # are equivalent.

Here are some other examples of string concatenation.

echo ’CIRCLE ( ’$centre[1]’, ’$centre[2]’, 20 )’ > $file1".ard" gausmooth in=$root"$suffix" out=${root}_sm accept
linplot ../arc/near_arc"($lbnd":"$ubnd)"

#### 11.2 Obtain the length of a string

This requires either the wc command in an expression (see Section 10), or the awk function length. Here we determine the number of characters in variable object using both recipes.

set object = "Processing NGC 2345"
set nchar = ‘echo $object | awk ’{print length($0)}’‘      # = 19
@ nchar = ‘echo $object | wc -c‘ - 1 If the variable is an array, you can either obtain the length of the whole array or just an element. For the whole the number of characters is the length of a space-separated list of the elements. The double quotes are delimiters; they are not part of the string so are not counted. set places = ( Jupiter "Eagle Nebula" "Gamma quadrant" ) set nchara = ‘echo$places | awk ’{print length($0)}’‘ # = 35 set nchar1 = ‘echo$places[1] | awk ’{print length($0)}’‘ # = 7 set nchar2 = ‘echo$places[2] | awk ’{print length($0)}’‘ # = 12 #### 11.3 Find the position of a substring This requires the awk function index. This returns zero if the string could not be located. Note that comparisons are case sensitive. set catal = "NGC " set number = 2345 set object = "Processing$catal$number." set cind = ‘echo$object | awk ’{print index($0,"ngc")}’‘ # = 0 set cind = ‘echo$object | awk ’{print index($0,"NGC")}’‘ # = 12 set places = ( Jupiter "Eagle Nebula" "Gamma quadrant" ) set cposa = ‘echo$places | awk ’{print index($0,"ebu")}’‘ # = 16 set cposn = ‘echo$places | awk ’{print index($0,"alpha")}’‘ # = 0 set cpos1 = ‘echo$places[1] | awk ’{print index($0,"Owl")}’‘ # = 0 set cpos2 = ‘echo$places[2] | awk ’{print index($0,"ebu")}’‘ # = 8 set cpos3 = ‘echo$places[3] | awk ’{print index($0,"rant")}’‘ # = 11 An array of strings is treated as a space-separated list of the elements. The double quotes are delimiters; they are not part of the string so are not counted. #### 11.4 Extracting a substring One method uses the awk function substr($s$,$c$,$n$). This returns the substring from string $s$ starting from character position $c$ up to a maximum length of $n$ characters. If $n$ is not supplied, the rest of the string from $c$ is returned. Let’s see it in action. set caption = "Processing NGC 2345." set object = ‘echo$caption | awk ’{print substr($0,12,8)}’‘ # = "NGC 2345" set objec_ = ‘echo$caption | awk ’{print substr($0,16)}’‘ # = "2345." set places = ( Jupiter "Eagle Nebula" "Gamma quadrant" ) set oba = ‘echo$places | awk ’{print substr($0,28,4)}’‘ # = "quad" set ob1 = ‘echo$places[3] | awk ’{print substr($0,7)}’‘ # = "quadrant" An array of strings is treated as a space-separated list of the elements. The double quotes are delimiters; they are not part of the string so are not counted. Another method uses the UNIX cut command. It too can specify a range or ranges of characters. It can also extract fields separated by nominated characters. Here are some examples using the same values for the array places set cut1 = ‘echo$places | cut -d ’ ’ -f1,3‘  # = "Jupiter Nebula"
set cut2 = ‘echo $places[3] | cut -d a -f2‘ # = "mm" set cut3 = ‘echo$places | cut -c3,11‘        # = "pg"
set cut4 = ‘echo $places | cut -c3-11‘ # = "piter Eag" set cut5 = ‘cut -d ’ ’ -f1,3-5 table.dat‘ # Extracts fields 1,3,4,5 # from file table.dat The -d qualifier specifies the delimiter between associated data (otherwise called fields). Note the the space delimiter must be quoted. The -f qualifier selects the fields. You can also select character columns with the -c qualifier. Both -c and -f can comprise a comma-separated list of individual values and/or ranges of values separated by a hyphen. As you might expect, cut can take its input from files too. #### 11.5 Split a string into an array The awk function split($s$,$a$,sep) splits a string $s$ into an awk array $a$ using the delimiter sep. set time = 12:34:56 set hr = ‘echo$time | awk ’{split($0,a,":"); print a[1]}’‘ # = 12 set sec = ‘echo$time | awk ’{split($0,a,":"); print a[3]}’‘ # = 56 # = 12 34 56 set hms = ‘echo$time | awk ’{split($0,a,":"); print a[1], a[2], a[3]}’‘ set hms = ‘echo$time | awk ’{split($0,a,":"); for (i=1; i<=3; i++) print a[i]}’‘ set hms = ‘echo$time | awk ’BEGIN{FS=":"}{for (i=1; i<=NF; i++) print $i}’‘ Variable hms is an array so hms[2] is 34. The last three statements are equivalent, but the last two more convenient for longer arrays. In the second you can specify the start index and number of elements to print. If, however, the number of values can vary and you want all of them to become array elements, then use the final recipe; here you specify the field separator with awk’s FS built-in variable, and the number of values with the NF built-in variable. #### 11.6 Changing case Some implementations of awk offer functions to change case. set text = "Eta-Aquarid shower" set utext = ‘echo$text | awk ’{print toupper($0)}’‘ # = "ETA-AQUARID SHOWER" set ltext = ‘echo$text | awk ’{print tolower($0)}’‘ # = "eta-aquarid shower" #### 11.7 String substitution Some implementations of awk offer substitution functions gsub($e$,$s$) and sub($e$,$s$). The latter substitutes the $s$ for the first match with the regular expression $e$ in our supplied text. The former replaces every occurrence. set text = "Eta-Aquarid shower" # = "Eta-Aquarid stream" set text = ‘echo$text | awk ’{sub("shower","stream"); print $0}’‘ # = "Eta-Aquxid strex" set text1 = ‘echo$text | awk ’{gsub("a[a-z]","x"); print $0}’‘ # = "Eta-Aquaritt stream" set text2 = ‘echo$text | awk ’{sub("a*d","tt"); print $0}’‘ set name = "Abell 3158" set catalogue = ‘echo$name | awk ’{sub("[0-9]+",""); print $0}’‘ # = Abell set short = ‘echo$name | awk ’{gsub("[b-z]",""); print $0}’‘ # = "A 3158" There is also sed. set text = ‘echo$text | sed ’s/shower/stream/’‘

is equivalent to the first awk example above. Similarly you could replace all occurrences.

set text1 = ‘echo text | sed ’s/a[a-z]/x/g’‘ is equivalent to the second example. The final g requests that the substitution is applied to all occurrences. #### 11.8 Formatted Printing A script may process and analyse many datasets, and the results from its calculations will often need presentation, often in tabular form or some aligned output, either for human readability or to be read by some other software. The UNIX command printf permits formatted output. It is analogous to the C function of the same name. The syntax is printf "<format string>" <space-separated argument list of variables> The format string may contain text, conversion codes, and interpreted sequences. The conversion codes appear in the same order as the arguments they correspond to. A conversion code has the form %[flag][width][.][precision]code where the items in brackets are optional. • The code determines how the output is is converted for printing. The most commonly used appear in the upper table. • The width is a positive integer giving the minimum field width. A value requiring more characters than the width is still written in full. A datum needing few characters than the width is right justified, unless the flag is -. * substitutes the next variable in the argument list, allowing the width to be programmable. • The precision specifies the number of decimal places for floating point; for strings it sets the maximum number of characters to print. Again * substitutes the next variable in the argument list, whose value should be a positive integer. • The flag provides additional format control. The main functions are listed in the lower table. The interpreted sequences include: \n for a new line, \" for a double quote, \% for a percentage sign, and \\ for a backslash.  Format codes Code Interpretation c single character s string d, i signed integer o integer written as unsigned octal x, X integer written as unsigned hexadecimal, the latter using uppercase notation e, E floating point in exponent form m.nnnnne$±$xx or m.nnnnnE$±$xx respectively f floating point in mmm.nnnn format g uses whichever format of d, e, or f is shortest G uses whichever format of d, E, or f is shortest  Flags Code Purpose - left justify + begin a signed number with a + or - blank Add a space before a signed number that does not begin with a + or - 0 pad a number on the left with zeroes # use a decimal point for the floating-point conversions, and do not remove trailing zeroes for g and G codes If that’s computer gobbledygook here are some examples to make it clearer. The result of follows each printf command, unless it is assigned to a variable through the set mechanism. The commentary after the # is neither part of the output nor should it be entered to replicate the examples. Let us start with some integer values. set int = 1234 set nint = -999 printf "%8i\n"int            # 8-character field, right justified
1234
printf "%-8d%d\n" $int$nint   # Two integers, the first left justified
1234    -999
printf "%+8i\n" $int +1234 printf "%08i\n"$int
00001234

Now we turn to some floating-point examples.
set c = 299972.458
printf "%f %g %e\n" $c$c $c # The three main codes 299972.458000 299972 2.999725e+05 printf "%.2f %.2g %.2e\n"$c $c$c     # As before but set a precision
299972.46 3e+05 +3.00e+05
printf "%12.2f %.2G %+.2E\n" $c$c $c # Show the effect of some flags, 299972.46 3.0E+05 +3.00E+05 # a width, and E and G codes Finally we have some character examples of printf. set system = Wavelength set confid = 95 set ndf = m31 set acol = 12 set dp = 2 printf "Confidence limit %d%% in the %s system\n"$confid $system Confidence limit 95% in the Wavelength system # Simple case, percentage sign printf "Confidence limit %f.1%% in the %.4s system\n"$confid $system Confidence limit 95.0% in the Wave system # Truncates to four characters set heading = ‘printf "%10s: %s\n%10s: %s\n\n" "system"$system "file" $ndf‘ echo$heading                                  # Aligned output, saved to a
system: Wavelength                         # variable
file: m31

printf "%*s: %s\n%*s: %.*f\n\n" $acol "system" \$system $acol "confidence" 2$confid
system: Wavelength                       # Aligned output with a variable
confidence: 95.00                            # width and precision

foreach k ( $keywords ) # variable in a loop. Note the set heading =$heading ‘printf "%8s  " $k‘ # absence of \n. end echo ‘printf "%s\n"$heading