45, 346-356 (1989)
Multiscale Feature Extraction from Range Data B. PARVIN AND G. MEDIONI
Institute of Robotics and Intelligent Systems,Department of Electrical Engineeri’ng, Powel Hall, 233, University of Southern California, Los Angeles, California 90089-0272 Received February 17,1988; accepted October 7,1988 We present a method to extract meaningful features from range images which introduces some novel ideas: First, our primitives are a richer set than the usual one, as we not only extract high frequences events, which correspond to jump boundaries and sharp creases, but also low frequency events such as smooth ridges and ravines. These last features are generally discarded as they tend to hide in noise; however, they provide a coarse description of the shape, and we believe they may help in inferring volumetric descriptions. The method works at multiple scales to improve reliability, and we have designed a control strategy to automaticaNy select the most appropriate mask size. We present results on images of different levels of complexity from three different sensors. 0 1989 Academic Press, k 1. INTRODUCTION
We are interested in the description of scenes obtained through the use of a range finder, for the tasks of object recognition, inspection, and manipulation. The representation scheme, in order to be useful, needs to f&l1 some constraints. It should be rich, to allow discrimin ation of similar objects. This requirement is especially necessary for range images, as opposed to intensity images, since they only encode distance to the sensor and ignore rich cues such as texture, reflectance, and surface markings. It should be stable, so that small changes in the input have very small effects on the resulting description, and it should be computed using &real support, in order to deal with partially occluded objects. We believe that lines corresponding to discontinuities of the surface (points of inflection) or the local surface orientation (creases), together with ridge and ravine lines, fulfill the conditions enumerated above. We extract them by fitting a continuous function to a variable support of data points. The fit should be poor around sharp discontinuities and generate a zero-crossing of the smoothed derivative for ridges and ravines. It is believed that these features will ultimately aid in volumetric description such as generalized cones. Most of the previous research has concentrated on high frequency discontinuities, such as jump boundaries and sharp folds, for the task of object recognition. Detection of low frequency discontinuities requires large spatial support for their identification. However, they often describe local symmetry which can be used to infer object shape. Such a wide range of features demand a large number of scales for their detection and localization. Previous work on the subject, most notably [7,15], proceeded by explicitly computing all features at a number of scales and then establishing correspondences between features at various scales. The correspondence phase is a necessary step for veri&cation and localization. This is a very complex problem to which a satisftiory solution has not been offered, and previous implementations relied on heuristics to address it. Instead, our approach is to locally find the best scale to describe a g&en 346 0734-189X/89
0 1989 by Academic Press, Inc. All rights of reproduction in any form reserved.
feature, therefore sidestepping the correspondence problem. We differ from past research in that the control strategy automuticully selects an adequate scale (kernel) size for the localization and detection of surface features. This is done by either expanding or shrinking the kernel size as the conditions dictate. All features are extracted using first and second derivatives properties which are computed from the underlying continuous function. We briefly review the past work in the domain; then in Section 3 we present the details of our computational technique, and show results on several scenes of varying complexity obtained from different sensors. 2. PAST
The field of robotics vision, and especially range image analysis, has been very active recently, mostly due to the availability of devices to generate good quality data. A recent survey paper  provides a comprehensive coverage of the state of the art in the field. Approximating scenes by simple surfaces, such as triangles or spline patches, although well suited for graphics applications, does not address the problem of scene understanding. Extended Gaussian images, proposed by Horn and Ikeuchi [ll, 121 are elegant, but are global descriptions. Langridge  showed some preliminary results on the detecting of discontinuities in the first derivatives of surface properties. Nackman  proposes to extract peaks, pits, and saddle points of surfaces to make a graph out of these features. Besl and Jain  propose to extract a few “seed” regions by computing simple curvature properties, then to grow and merge them to obtain homogeneous adjacent patches. The results shown are nice, but the control structure of the algorithm is very complex. Haralick et al. [9,10,17] propose the “topographic primal sketch,” in which a continuous function is fit at each pixel to estimate the derivatives at that point. Each pixel is then labelled as ridge, peak, pit, ravine, slope, and so on. This representation is still dense, and the problem of support size (scale) is mentioned but not solved. Following ideas put forth by Witkin , Asada and Brady [l] proposed a multiscale approach to extract important features of planar contours. The curve is repeatedly smoothed with Gaussian masks, and features are tracked along this dimension. This work was extended to surfaces by Ponce and Brady  and by Fan, Medioni, and Nevatia [7,8]. Primitives correspond to various combinations of zero-crossings and extrema of the curvature properties, as the original image is being smoothed with Gaussian filters. Even though the authors show impressive results on a number of scenes, they leave open the questions regarding the automatic determination of the smoothing parameters and, instead, adjust them for each image individually. 3. DESCRIPTION
Our goal is to reliably extract fundamental descriptors of the surface such as edge points, creases, ridge, and ravine lines. For the method to be robust, we would like to have the following properties: the feature detector should be reliable and accurate. That is, it should not detect very weak features, should locate them well, and also give a single response for each feature. These constraints are similar to the ones given by Canny . The key to our feature detection scheme lies in the relationship between the original discrete data points and their approximation by a
continuous function. This fit is observed as we vary the size of the support to confirm or reject detection. At each pixel, we use a number of neighbors to compute a cubic polynomial approximation of the data. It is then possible to compute a number of properties involving differential calculus. Since we are working with a depth map, a surface patch should be computed at each pixel, but this involves a large matrix multiplication, which would become computationally prohibitive since we work with many filter sizes. Instead, we compute two l-dimensional polynomials in orthogonal directions. The results show this approximation to be legitimate. We have chosen to use a third-order polynomial to represent the underlying pixel activity in the image. The rational for the third-order is that the root of the second derivative can be computed. The fit can be obtained either by an orthogonal set or B-splines. An orthogonal set offers certain computational advantages in multiscale analysis, as the partial results can be carried from one scale into the next. However, the fit usually oscillates near jump boundaries, and introduces additional zero-crossings. On the other hand, B-splines attempt to minimize the potential energy and ensure a smooth fit everywhere. The orthogonal set used in our work is based on Legendre polynomials which are eigenfunctions of Legendre differential equation. These polynomials can operate on equally spaced data, and are defined over an arbitrary interval -m 5 XI +m as follows: P,(X)
Where g(X) is the pixel intensity at location X. Discretization may introduce implementation problems in evaluating the above integral. This problem is resolved by using Simpson’s rule which is exact for a cubic form. An alternative to the orthogonal set is the B-spline polynomial which is defined over a neighborhood of four pixels by:
Larger operators can be constructed by convolving the above operator with a Gaussian mask iteratively. This implementation was proposed by Brady ef al. . Brady points out that an n times repeated convolution with this mask corresponds to filtering the data with a Gaussian with a approximate standard deviation of 6. In the rest of the paper, g, represents the intensity (range) at pixel i, & and 2 are the first and second derivatives of this function. 3.1. Zero-Crossings of the Second Derivative At any given scale, an edge is detected by the zero-crossing of the second derivative. To be more precise, at the center of a pixel, the second and first derivatives should be minimum and maximum, respectively. Therefore, one may choose to threshold zero-crossings when Ii/g] falls below a small threshold. The alternative is to threshold jump boundaries based on the magnitude of the gradient at the location of the zero-crossing, i.e., where S = 0. This is done by finding the root of the zero-crossing, and evaluating the gradient at that point. Furthermore, for an edge point to be valid, the root of the zero-crossing must fall within the pixel’s area. This condition aims at removing those zero-crossing points that are due to spurious noise. An additional piece of information that is readily available is the magnitude of the error in the cubic fit. The error is defined as the difference between the estimate and the true value of the depth map. It is natural to expect a poor fit at the jump boundary, causing a large error in the local estimate. This is a redundant criterion which is not used in our current implementation. Thus, we have the following conditions: 1. 121 > S which is a threshold, and 2. Jm. a,/5a,l I 0.5 which is the location of the zero-crossing root derived from Legendre polynomial with kernel size of 2m + 1. 3.2. Zero-Crossings
of the First Derivative
Ridges, ravines, and some types of creases create a zero-crossing derivative, and the choice between ridge or valley is decided by the second derivative. For an ideal peak, the gradient changes its sign from negative smoothly. Assuming that the correct kernel size of size 2m + used, the following constraints must be satisfied:
in the first sign of the positive to 1 has been
1. Ig, - &I < C: where E is the maximum allowable error in the fit defined locally at the location of the fit. This condition aims at classifying zero-crossing points only when the fit is good. A good example is the artificial peak generated by fitting a polynomial to a thin bar, thus, creating an artificial peak. The error in the fit can be specified if the signal to noise ratio is known. Otherwise, it is a smoothing parameter, and 2. the root of the zero-crossing should fall within the pixel’s area. Since two such roots exist, only the root closer to the center of pixel is considered. 3.3. Control Strategy for the Kernel Size
The kernel size or scale refers to the size of the neighborhood used for polynomial approximation, and it is a crucial parameter for feature extraction. If the kernel size is too small then there will be many zero-crossings, most of them due to noise. On
the other hand, if the kernel size is too large, the actual location of zero-crossings may be misplaced by several pixels. To sum up, small kernel size has a poor noise immunity while the large one has a poor localization property. Therefore, a strategy is needed to decide on an appropriate kernel size. This strategy is a two step process. First we apply a medium size kernel (band pass) along a direction. Whenever the error in the fit exceeds E, the filter size is reduced until either the error is acceptable or the minimum kernel size is reached. Second, we attempt to either localize or merge features. The localization is done by gradually applying smaller filters, and the merging is performed by applying larger filters. The basic idea is to obtain a fit with an appropriate kernel size which is smooth even after difirentiation. Several configurations arise: 1. Zero-crossing of the second derivative. In this case, the size of the kernel is gradually reduced and the location of the zero-crossing is tracked until either the location of the zero-crossing falls within the pixels area or the minimum kernel size is reached. At this point, if the condition of the jump boundary is satisfied, a label is assigned. It is possible to have two categories of edge label based on the sign of gradient (positive gradient indicates entering an edge, and negative gradient indicates leaving an edge). For some objects, such a detailed description allows pairing line segments at a higher level of interpretation. 2. Zero-crossing of the first derivative. A zero-crossing in gradient may indicate either a ridge or a valley, and the selection of the kernel size is based on the following simple rule: The ideal kernal size is the one that is minimum in size and produces u maxima of the curvature at the location of zero-crossing. Two possibilities exist: (a) Only one zero-crossing occurs along the length of the kernel. In this case, the kernel size may be increased or decreased depending upon whether the maximum curvature exists: i. If a maximum in the curvature exists, then the kernel size is gradually decreased until either the maxima vanishes or the integrity of the fit is altered by introducing additional zero-crossings. ii. Otherwise, the kernel size is increased until either a maximum is achieved or the maximum kernel size is reached. (b) Multiple zero-crossings occur along the length of kernel. This condition occurs when the kernel size is too small to detect smooth transitions. As a result several zero-crossings in close proximity are generated. In this case, first, the discontinuity is traced with a larger kernel until either all the zero-crossings merge together or the maximum kernel size is reached. Second, if a single zero-crossing is detected, the size of the kernel is gradually increased until a maximum of curvature is observed. Localization is an important step for feature extraction. For example, a stair case input may create an artificial inflection point between two real edge points. One approach to resolving this side effect is by localizing inflection points with a smaller filter, which causes the in&&ion discontinuity to vanish. Also, features collected along orthogonal directions may not necessarily use the same scale, thus leading to
of the cup.
displaced and non-overlapping descriptors along each axis, and creating ethos near discontinuities. This problem is also cured by accurate localization. 3.4. Fast Implementation of the Control Strategy It is desirable to avoid computation of the polynomial coefficients at every scale. Toward this end, we show a quick way to compute the ak values based on the previous ones: Any least squares fit polynomial can be implemented in terms of convolutions. It is our intent to reduce the computational load when the kernel size is varied. This is done by a decomposing the polynomials and maintaining the unnormalized inner product sum. For example a2 can be written as: a 2 = -$j-;
dX - ;j-;
For each of the above expressions, the value of the integral is retained. In the subsequent scale, these values are updated by the residual components of the scale size. Such an approach significantly reduces the computational load for multiscale feature extraction. 4. RESULTS
In this section, the algorithm is demonstrated on manmade and complex natural objects. These are shown in Figs. 1 through 4. All of the images are 12%by-128, they have been normalized to eight bits from their original range and no smoothing is applied prior to the polynomial fit. Images 1 and 2 are from a time of flight laser developed by ERIM, courtesy of Professor R. Jain. The third image is a drill with a complex surface structure, courtesy of Dr. J. Ponce. Image 4 was obtained at USC using a commercial laser triangulation system. The same set of parameters are used for all images. These are: 1. initial kernel size = 11, 2. minimum kernel size = 5,
of the double
3. maximum kernel size = 37, 4. maximum allowable error in the fit = 3.0, and 5. minimum threshold for accepting jump boundaries = 5.0. In Figs. 1 and 2, the discontinuities are directly superimposed data. The first image represent a cup used in 12,3]. Note that:
over the dense
(i) the handle is correctly segmented from the main body, (ii) no false ridge is detected on the handle, and (iii) the ridge and ravine lines due to smooth variations of the main body are extracted. The second image (Fig. 2) is an industrial part. This image contains jump boundaries, sharp creases, and ridges at several scales. For sharp creases, the filter is decreased automatically; and for ridge lines, such as those on the main body of cylinder, the kernel size is increased. Note that in both images the ridge lines on the cylinder indicate a strong local symmetry which can be used for shape inference. Most researchers display range images by encoding depth in grey level, but, this produces images with poor dynamic range. In the rest of this paper, we present range images by borrowing a technique from computer graphics. We compute the normal to the surface in a small neighborhood. Assuming that the object is lambertian, the reflectance image is generated in which the intensity is inversely proportional to the angle between the light source and the surface normal. The third image is the dense map of a drill. Figures 3a, 3b, and 3c show its original intensity, the shaded image, and the result of feature extraction. A ridge location is marked with a “ + ,” a ravine with a “ - ,” and an edge location with “*.” This image has a very complex surface structure. Detection of edge points is achieved by small scales, while ridge and ravine by medium scales. One can verify that correct features are appropriately extracted. Note the saddle points where ridge and ravine lines intersect. The last image is the dense map of a tooth with surface structures at werent scales of complexity, shown in Fig. 4. The actual image of this object has no
of the drill.
image of the drill.
of the drill.
(Cl FIG. 4. image.
(a) Shad ed image of the tooth. (b) Perspective view of the tooth. (c) Features of the tooth
contrast to a human observer (and is not shown). However, the shaded image, (Fig. 4a), reveals major valleys along the horizontal and vertical directions. In addition, smooth transitions of surface orientation in the form of ridges are present in the perimeter of the image. Figure 4b shows the perspective view of the original data used for processing. The result is shown in Fig. 4c. This image is of particular interest since its surface structure is highly irregular. Other than a sharp valley that runs across the horizontal direction, most of its surface requires medium to large support. In spite of this, the surface discontinuities are correctly detected and labelled. 5. CONCLUSION
We have presented a method to extract a rich set of features from a range image, representing both high frequency (jump boundaries, edges, and sharp creases) and low frequency (ridges and ravines) events. Together, they can produce a coarse to fine description of the shape and may help in generating volumetric descriptions. The approach is novel in that the size of the smoothing masks is automatically adjusted, resulting in robust and accurate detection. There are a number of points which may be improved; in particular, we know that a basis of cubic B-splines will give better results and using 1 D approximations might limit the power of the method. Our next step will also be to link these pointwise descriptions into curves of interest and to repeat the multiscale approach on these curves, generating a hierarchical representation. This approach will not only provide a more complete form of representation but also will drastically simplify higher level tasks such as recognition. REFERENCES 1. H. Asada and M. Brady, The curvature primal sketch, IEEE Trans. Pattern Anal. Mach. Intell. 8, No. 1, 1986, 2-15. 2. P. Besl and R. Jam, Segmentation through symbolic surface description, in IEEE Conf on Computer Vision
3. P. Besl and R. Jam, Three dimensional object recognition, J. Comput. Suroeys 17, No. 1, 1985, 75-145. 4. B. Boissonnat and 0. Faugeras, Triangulation of 3D objects, in Proceedings, 7th International Joint Conference
5. M. Brady, J. Ponce, A. Yuille, and H. Asada, Describing surfaces, J. Comput. Vision Graphics Image Process. 32, 1985, l-28. 6. J. Canny, A. computational approach to edge detection, IEEE Trans. Pattern Anal. Mach. Intell. 8, No. 6, 1986,679-699. 7. T. J. Fan, G. Medioni, and R. Nevatia, Description of surfaces from range data using curvature properties, in IEEE Conf. on Computer Vision and Pattern Recognition, June 1986, 86-91. 8. T. J. Fan, G. Medioni, and R. Nevatia, Description of 3D surfaces using curvature, in Proceedings, DA RPA
9. R. Haralick, Ridges and valleys on digital images, Comput. Vision Graphics Image Process. 22, 1983, 28-38. 10. R. Haralick, L. Watson, and T. Ltiey, The topographic primal sketch, Int. J. Robotics Res. 2, No. 1, 1983, 50-71. 11. B. Horn, Extended Gaussian images, Proc. IEEE 72, 1984, 1656-1678. 12. K. Ikeuchi, B. Horn, et al., Picking up an object from a pile of objects, in First International Symposium on Robotics Research, 139-162, MIT Press, Cambridge, MA, 1984.
13. D. Langridge, Detection of discontintities in the first derivatives of surfaces, Comput. Vision Gruphicv Image Process. 27,1984,38-43. 14. L. Nackman, Two-dimensional critical configuration graphs, IEEE Trans. Pattern Anal. Mach. Intell. 6, 1984, 442-449. 15. J. Ponce and M. Brady, Toward a surface primal sketch, in IEEE Int. Con& on Robotxs awd Automation, March 1985, 420-425. 16. A. Witkin, Scale-space filtering, in Proceedings Seventh NCAI, Aug. 1983, 1019-1022. 17. L. Watson, T. LaKey, and R. HaraIick, Topographic classification of digital image intensity surface using generalized splines and discrete cosine transform. Comput. Vision Graphics Image Process., 29,1985, 143-167.