Finds values corresponding to specified fractions of an array’ s ordered distribution KPG1_FRACx
’
s ordered distribution,
such as percentiles. Thus to find the upper-quartile value, the fraction would be 0.75. Since it uses an
histogram technique rather than sorting the whole array, for efficiency, the result may not be exactly
correct. However, the histogram has a large number of bins (10000), combined with linear
interpolation between bins in the routine reduce the error. The histogram extends between the
minimum and maximum data values.
The routine also has an iterative method, whereby outliers, which compress the vast majority of data values into a few bins, are excluded from the histogram. Clipping occurs from both ends. A contiguous series of bins are removed until the largest or smallest fraction is encountered. Where the rejection of bins end, defines new limits, encompassing the vast majority of values. A new histogram is calculated using these revised limits. The excluded outliers are still counted in the evaluation of the fractions. The criterion for iteration may need tuning in the light of experience. At present it is when there are fewer than 4% non-zero bins.
The iteration can still fail to find accurate fractional values if smallest and largest fractions are close to 0 or 1 and correspond to extreme outliers. The routine recognises this state and determines the values for each outlier fraction separately, and then uses the next interior fraction as the limit. Then the routine proceeds with the clipping described above.
There is a routine for each numerical data type: replace "
x"
in the routine name by B, D, I, R,
W, UB or UW as appropriate. The array supplied to the routine must have the data type
specified.
For integer types the number of bins does not exceed the data range. The number of bins is reduced as clipping occurs.
The iterative algorithm is not especially efficient; for data with a very wide range the iterations can be numerous. A sigma-clipping approach to remove the outliers might be better.
The adjustment of the limiting fractions is done for each limit separately, thus involving a further pass through the array. At present it finds the more extreme outlier first by comparing the bin number of the limits with respect to the mean bin number.