22, 260-267 (1983)
NOTE Nonlinear Reduction of Data JOHN M. EINBU Norwegian
Received May l&1981;
Oslo 3, Norway
revised March 3,1982
The problem of data reduction (sifting) of planar point series is approached from an information theoretical angle. It is concluded that effective data reduction can be achieved only when there exists a complementary relationship between the data reduction algorithm and the subsequent interpolation algorithm. A nonlinear data-reduction algorithm is described and examined. Some computational results demonstrate good agreement between the predicted and the actual behavior of the algorithm.
The data reduction considered in this paper is applicable to planar point series sampled by table digitizers, stereo-plotters or similar electromechanical devices in time-interval mode. The point series are captured because they carry information about the curves from which they are sampled. It will be assumed that these curves are continuous and without cusps. Since sampling in time-interval mode is unintelligent, the information content of the individual points will vary over the point series. Therefore, it may sometimes be desirable, in order to reduce storage space and transmission and processing costs, to eliminate some of the points which carry the least information. The act of compressing data without losing too much information is called sifting. Two aspects of sifting are discussed in this paper. The first, which is dealt with in the next section, concerns the information-theoretical basis of a sifting strategy and relates, from an intuitive viewpoint, the information content of a point in a point series to that of its predecessor and successor points and to the interpolator which will ultimately synthesize a curve from the data. The second aspect, presented in the final section, concerns the practical realization of the sifting strategy by means of a sifting algorithm adapted to a typical nonlinear interpolator.
Information theory asserts that information is something that reduces the uncertainty of a phenomenon. In the area of sifting, the phenomenon in question is the shape of synthesized curves. The information content of a point in a point series is thus a measure of how much better the curve through the point series can be estimated with the given point than without it. If there is a big difference in the estimates, the point has a large information content and vice versa. In the extreme case, when the two estimates are identical, the point has no information. It should be noted that in this consideration there is no reference to the original curve: the 260 0734-189X/83
Copyright Q 1983 by Academic Press. Inc. All rights of reproduction in any form reserved.
information content, as it is to be understood here, is tied to relative differences and not to the absolute difference between the original curve and an estimate of it. Several observations can now be made. The first observation is that the information content of a point cannot be established in an absolute sense, but only relative to that of its neighbor points. If a neighbor point is deleted, the information content will normally change-most probably it will increase. This observation dictates that one must proceed with caution if one intends to sift points on the basis of their information content relative to their nearest neighbors and to a certain threshold value. Otherwise it may happen that long sequences of consecutive points are deleted because their relative information contents all lie below the threshold, with the result that the ultimate synthesized curve will follow a course far away from the deleted sequence. The avoidance of such unfortunate effects requires the introduction of “fixed” points, points that are not deleted under any circumstance. (In the spirit of the present approach, these points could be said to have an information value of infinity.) It further requires that the information content of the noncommitted points should be determined entirely on the basis of the current set of fixed points. The first and last points should always be fixed points. Also, if there are distinct critical points in the sense of Freeman [ 1] these may be included among the fixed points. Moreover, to control the maximum gap between points in the output series. it may be convenient, initially, to make every n th point, for some n > 1, a fixed point. In the sifting process new fixed points are added dynamically according to criteria which take into account the instantaneous information content of the points. A natural choice of the measure of the information content of a noncommitted point under consideration is the shortest distance between the point and the curve estimated on the basis of the current fixed points. However, the observations made in this section are also valid for other measures. Under some conditions it is desirable to let the sifting take place concurrently with the capturing of the data. One problem encountered in this connection is that at the stage where a decision has to be made about whether to retain or drop a point, not all the fixed points are available. The decision must therefore be based on an incomplete set of fixed points. How detrimental this is depends on how long the decision can be delayed. If one or two fixed points on both sides of the point that IS about to be evaluated are known, a quite accurate measure of its information content can probably be achieved; if only the fixed points on one side are known. then the information content may be poorly estimated. This leads to the second significant observation: sifting under data registration with no “look ahead” possibilities cannot be very efficient. Is there any obvious choice for the shape of the estimated curve? Consider the operational use of the sifted point series: its most significant role is to act as input to an interpolation algorithm (for short, an interpolator) which lays a continuous line through the points. The point series must therefore be prepared so that the interpolator can perform its task as well as possible. This implies that, ideally, the interpolator should be employed for curve estimation in the sifting algorithm (for short, the sifter). Consequently, the third observation in this section is that the sifting process should be complementary to the interpolation process. Specifically, a nonlinear interpolator requires a nonlinear sifter. As an illustration, consider the two point series ABCD and ABC’D in Fig. 1. ABD are the fixed points and C and C’, lying on the chord between B and D and on
JOHN M. EINBU
FIG. 1. Two point series each with three fiied points and one free point.
the circle through ABD, respectively, are the noncommitted points to be evaluated. Obviously, if the point series is to be subject to linear interpolation, point C will contain no information for the interpolator but C’ will; if the interpolator is nonlinear, laying circular arcs through the given points, the situation is reversed. In the first case C should be rejected and C’ retained; in the second case C should be retained and C’ rejected. An interpolator may sometimes be quite a sophisticated piece of software and therefore expensive to use as the estimator of a sifting algorithm. Also, the effort involved in obtaining an acceptable measure of the information content of a free point may be prohibitive. However, the decision to sift data implies a willingness to sacrifice some accuracy, and it would be inconsistent to utilize the interpolator as an estimate if simpler, nearly equivalent, curve synthesizing algorithms were available. It does not seem to be commonly recognized that, to be effective, a sifter must be geared to the behavior of the potential interpolator. This explains the ahnost complete lack of sifters tailor-made for nonlinear interpolators. The tendency is to select points for retention such that the polygonal path through the points satisfies a certain distance criterion (e.g., ) or a certain area criterion (e.g., ). Most existing sifters, therefore, belong to the class of linear sifters. As an exception, Page [4) is apparently aware of this state of affairs. He suggests that, as a final check of the candidate points for deletion after a linear filtering process, the interpolator curve should be passed through the retained points and that the ultimate decisions about the candidate points should be determined by their distances from this curve. This postprocess may secure valuable information that otherwise would have been lost, but it does not eliminate redundancies in the retained data. To end this section, some new terms that will be useful in a discussion of sifters will be introduced Suppose a criterion has been devised which can distinguish between acceptable and unacceptable output point series from a sifter (a point series being acceptable if it carries enough information to enable the subsequent interpolator to reproduce the original curve satisfactorily; otherwise it is unacceptable). Let P be the input and P, the output point series of the sifter. P, is thus a subset of P. The P, is then called “exhausted” if P, is acceptable but no proper subset of P, is acceptable. The P, is called “minimal” if P, is acceptable but no other acceptable subset of P with fewer points than P, exists. A point series may be exhausted without being minimal. A sifter that, for every input point series and a given interpolator, retains only a minimal subset is called optimal with respect to the given interpolator. Obviously it is difficult to establish whether a sifter is optimal or not.
3. A NONLINEAR
The estimator of a sifter complementary to a wide class of general-purpose nonlinear interpolators (including, for instance, the cubic spline interpolator) should generate curves which approximate the interpolator curves in the following sense: (i) The curve is tangentially continuous throughout. (ii) A segment joining two consecutive points has no inflection point if the segment emanates from the endpoints on the same side of the connecting chord; otherwise it has one inflection point. (iii) The tangent direction at a nonterminal point is determined chiefly by the position of the point relative to the two adjacent points, and does not deviate significantly from the direction of the tangent of a circle or a parabola through these three points. To be feasible and meaningful ments:
the sifter must also satisfy the following
(iv) The measure of the information content of a point must be close to or equal to the shortest Euclidean distance from the point to the estimator curve. (v) The computations involved in establishing the information content of a point must be inexpensive. It is the last requirement that disqualifies the cubic spline interpolator from being a suitable estimator. An estimator curve consists, by definition, of points with no information content; such a curve may therefore conveniently be referred to as a “zero curve.” The zero curve of the proposed nonlinear sifter will now be discussed. The Zero Curve of the Nonlinear Sifter
To consider the nonterminal case first, let the four points A, B, C, and D in Fig. 2a and b be consecutive fixed points and let P be an arbitrary point (not necessarily an input point) in the region between B and C. Further, in each of the two configurations in Fig. 2 are shown two circular arcs with radii rl and r2 passing through ABC and BCD, respectively. The radii are signed such that arcs that run counterclockwise when passing through points in input order have positive radii, and vice versa. Thus r2 in configuration b is the only radius in Fig. 2 with positive sign. The distances from P to the circles, corresponding to dr, and dr2, are also signed and such that Iri -+ dr,l = C,P, for i = 1,2. Finally, S, and sZ are the lengths of the lines BP and PC, respectively. The segment of the zero curve extending between B and C may be defined as the locus of all points P between B and C satisfying dr,s, + dr2s, = 0.
In spite of the simplicity of its analytical expression, the zero-curve segment does not lend itself easily to an exhaustive analytical treatment. However, the following
JOHN M. EINBU
FIG. 2. Auxiliary points and lines needed for defining a segment of the estimator curve.
characteristics of the segment are conspicuous: The segment lies entirely between or on the two auxiliary circles. The segment has a common tangent with the arc ABC in B and a common tangent with the arc BCD in C (this can be proved rigorously by means of elementary methods from calculus). The segment intersects the central axis C,C, halfway between the two circles (easily derived since in this case s, = sZ). Segments are tangentially continuous and have zero or one inflection point, depending on whether they are associated with configurations of type a and b on Fig. 2, respectively (this seems intuitively plausible, but has not been proved rigorously). The second observation above implies that the zero curve is also tangentially continuous at the join of two adjacent segments. The only restriction that the method poses on the choice of fixed points is that the endpoints must always be fixed points. If, at the initial state of the sifting process of an open curve, only the endpoints are fixed, then the zero curve should be identical to the connecting chord between the two points. With more than two fixed points, the zero-curve segments of the first and last interval of open curves should coincide with the circular arcs through the first three points and the last three points. From the description and the deduced characteristics of the zero curve it is seen that, with one possible reservation about the number of inflection points, the curve satisfies requirements (i)-(iii) stated above. The zero-curve segments for the two configurations of Fig. 2 are shown in Fig. 3.
FIG. 3. Zero curves and lines of constant information content.
The Measure of Information Content The method used to measure the information content of a noncommitted point represents a compromise between requirements (iv) and (v) in the Introduction of this section. It discriminates between points lying outside and points lying inside the area limited by a line through B and C, and a line through C and C, (see Fig. 2). The points outside the area are, by definition, given an infinite information content and are therefore always retained. Points within or on the border of this area are assigned information content according to the formula
From (1) it follows directly that A point on the zero curve carries no information, A point outside the zero curve has positive information, A point’s information content increases with increasing distance from the zero curve, A point on the line C,B (C,C) in the proximity of B (C) has an information content close to the distance from the point to B (C) (follows from the fact that in this case S, (sZ) is close to zero), A point on the line C,C, has an information content equal to the distance from the point to the point where the line intersects the zero line (follows from the fact that S, = s2). Now, let P on Fig. 2 be a point on the zero curve. For a point Q outside the zero curve, formula (1) yields I = I[(&,
i- Arl)(sZ + As,) + (dr, + Ar,)(s,
+ As, + s2 + As,) (2)
for suitable values of Ari and As,, i = 1,2. If Q is close to P, the approximations
JOHN M. EINBU
As, = As, = 0 and Ar, = Ar, can be made, reducing (2) to I = ((dr,s, + dr,s,)/(s,
+ s2) + Ar,l = lAr,I.
This implies that I is close to the shortest Euclidean distance from the point to the zero curve for points close to that curve. The purpose of the measure 1 is that it should be used to decide whether a point should be deleted or not. It is therefore sufficient that the measure should satisfy requirement (iv) of the Introduction for points with information content below a given threshold value, that is, in a zone close to the zero curve. For points outside this zone, it is enough to know that their information content is above the threshold value. Therefore, with the modest requirement for accuracy that characterizes sifting processes in general, it may be concluded that the information measure I given by (1) is sufficiently accurate for the nonlinear sifter under consideration. For the end intervals, the information measure can be equated to the shortest Euclidean distance to the circular arcs or to the connecting chord, whichever is appropriate. In Fig. 3 are drawn two lines of points with constant information content (isolines) for each configuration. These lines may be conceived as the boundary lines for a “tolerance” zone; points within this zone will not, at least not at the current stage (see below), be admitted to the set of fixed points. The Global Mode of Operation
The user is free to select an initial set of fixed points which must include the two terminal points. If the user refrains from doing this, the two terminal points are selected by the sifter as the initial set. After the set of initial fixed points has been established, new fixed points are added by the sifter one at a time. The tolerance zones are updated dynamically and the process stops when none of the input points lie outside the current tolerance zone. The noncommitted input points (for short, the “free” points) are inspected collectively in one interval at a time starting with the first interval. If one or more points in the current interval lie outside the tolerance zone, the one with the largest information content is added permanently to the fixed points. If no points in the interval are outside the tolerance zone, the next interval is considered (if there are no intervals left the process terminates). When a new point in the interval is added to the fixed points, the dynamic updating of the tolerance zones necessitates a backtracking to the preceding interval (if any), to check whether the free points in that interval are still within the tolerance zone. Conceivably, a further regress may be necessary before the process becomes progressive again. For concretization, the addition of a fiied point between the points C and D on Fig. 2 (but not on the arc BCD) will alter the shape of the zero curve and tolerance zone in the interval BC. For online sifting, the backtracking process may have to be terminated after a limited number (e.g., one) of steps due to time shortage. In general, this limitation would probably incur no risk of information loss. The method may be considered a generalization of variant 2 of the Douglas and Peucker method , in particular under the condition that only the terminal points are the initial fixed points. Evidence of this is the fact that only two new statements had to be inserted in the computer implementation of the nonlinear sifter described here to enable it to act optionally as the linear sifter of Douglas and Peuclcer.
FIG. 4. Experiment with two different sifter/interpolator combinations. (a) Original point series and reference curve. (b) Linear sifter/nonlinear interpolator. (c) Nonlinear sifter/nonlinear interpolator.
In an experiment conducted to demonstrate the validity of the theory expounded in this paper, a series of 18 points (shown in Fig. 4a) was used. It was an arbitrary point series, not a selected one. Any other nontrivial series would have yielded the same demonstrative effect. The continuous curve through the points was produced by a nonlinear interpolator which extends two circular arcs between each pair of consecutive points. This curve was used as a reference curve, and the goodness of a sifter with respect to the nonlinear interpolator was measured by how well the same interpolator estimated the reference curve on the basis of the retained points. The point series was sifted, in two separate experiments, by the linear sifter of Douglas and Peucker and by the nonlinear equivalent described in this paper. T’he threshold values of the sifters were manipulated in such a way that exactly 9 points were retained in both cases. The results are shown in Fig. 4b and c. The reference curve is dashed and the reproduced curves are solid. The results clearly verify what was predicted theoretically: that the points retained by the nonlinear sifter convey more information to the interpolator than the points retained by the linear sifter. If the interpolator were linear, the output from the linear sifter would have carried more information. REFERENCES I. H. Freeman, Shape description via the use of critical points, Pattern Recognition 10, 1978, 159- 166. 2. D. Douglas and T. Peucker, Algorithms for the reduction of the number of points required to represent a digitized line or its caricature, The Canadian Cartogrupher 10, 1973, 112- 122. 3. D. E. McClure, Computation of Approximately Optimal Compressed Representations of Discretired Plane Curves, Proceedings of the IEEE Computer Society Conference on Pattern Recognition and Image Processing, 175- 182,6-S June 1977. 4. G. Page, Filtering of Digitized Lineal Features, Proceedings of the Digital Terrain Models Symposium, 269-274, American Society of Photogrammetry, Virginia, 1978.