Efficient indexing of high-dimensional data through dimensionality reduction

Efficient indexing of high-dimensional data through dimensionality reduction

Data & Knowledge Engineering 32 (2000) 115±130 www.elsevier.com/locate/datak Ecient indexing of high-dimensional data through dimensionality reduct...

586KB Sizes 0 Downloads 18 Views

Data & Knowledge Engineering 32 (2000) 115±130

www.elsevier.com/locate/datak

Ecient indexing of high-dimensional data through dimensionality reduction Cheng Hian Goh 1, Agnes Lim, Beng Chin Ooi, Kian-Lee Tan * Department of Computer Science, National University of Singapore, Lower Kent Ridge 119260, Singapore Received 4 November 1998; received in revised form 21 July 1999; accepted 23 July 1999

Abstract The performance of the R-tree indexing method is known to deteriorate rapidly when the dimensionality of data increases. In this paper, we present a technique for dimensionality reduction by grouping d distinct attributes into k disjoint clusters and mapping each cluster to a linear space. The resulting k-dimensional space (which may be much smaller than d) can then be indexed using an R-tree eciently. We present algorithms for decomposing a query region on the native d-dimensional space to corresponding query regions in the k-dimensional space, as well as search and update operations for the ``dimensionally-reduced'' R-tree. Experiments using real data sets for point, region, and OLAP queries were conducted. The results indicate that there is potential for signi®cant performance gains over a naive strategy in which an R-tree index is created on the native d-dimensional space. Ó 2000 Elsevier Science B.V. All rights reserved. Keywords: R-trees; Dimensionally-reduced R-trees; Hilbert curve; High dimensional space; Linear space

1. Motivation The need for ecient access to large collections of multidimensional data is becoming more important. This can be attributed to an increased number of database applications in which criteria for data retrieval are de®ned on multiple attributes of underlying information objects. For instance, this is the case for OLAP applications where large aggregate queries over a huge amount of multi-dimensional data are common. Another contributing factor stems from the observation that data in various multimedia databases can often be eciently clustered if they are reduced to feature vectors in some high-dimensional space. For instance, this approach has been used for providing indexed access based on shapes, colors, and textures of multimedia objects [6]. The R-tree [9] and its variants [1,16] are known to yield good performance compared to a plethora of competing multi-attribute indexing structures [12]. However, it has been observed that the performance of R-tree-based index structures deteriorates rapidly when the imensionality of *

Corresponding author. Tel.: +65-874-2862; fax: +65-779-4580. E-mail addresses: [email protected] (B.C. Ooi), [email protected] (K.-L. Tan) 1 Deceased on 1 April 1999. 0169-023X/00/$ - see front matter Ó 2000 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 9 - 0 2 3 X ( 9 9 ) 0 0 0 3 1 - 2

116

C.H. Goh et al. / Data & Knowledge Engineering 32 (2000) 115±130

data is high [2,3]. This is because overlap in the directory increases rapidly with growing dimensionality of data, requiring multiple subtrees to be searched for each query. Evidence of this e€ect is provided in Fig. 1(a), where the cumulative overlap (de®ned as the overlap between a query region and the bounding boxes corresponding to nodes encountered in the traversal of the R-tree) is plotted against the dimensionality of data. The query used to obtain the result has a 1% coverage of the entire data space. It is easily observed that this cumulative overlap increases exponentially with the dimensionality of the data indexed. Details of the data set used for this experiment are described in Section 4. Notice that the y-axis is reported in logarithmic-scale. This phenomenon can also be easily understood by considering the fan-out of internal nodes in an R-tree. Suppose we conform to the classic de®nition of an R-tree where all nodes are of a ®xed size B. The fan-out of an internal node is clearly bounded by $ % B ; Pd iˆ1 …2  si † P where si is the size of data elements corresponding to dimension i. (The expression diˆ1 …2  si † constitutes the storage needed to de®ne a minimal bounding region in a d-dimensional space.) Clearly, the fan-out of an R-tree is inversely proportional to the dimensionality of data objects (d). The smaller fan-out contributes not only to increased overlap between node entries but also the height of the corresponding R-tree. Fig. 1(b) provides yet another empirical validation of this ``dimensionality curse'' using a real data set: the average number of disk accesses for point queries is observed to increase exponentially as the dimension of underlying data objects increases. The problem which we have just described has been tackled in the literature using at least two approaches. The ®rst approach seeks to reduce the dimensionality of data in one of two ways: (1) by organizing the directory using only a small number of dimensions that are most representative of the data objects (as in the TV-tree [11]), or (2) by transforming data objects from a high-dimensional space into some lower dimensional space (as in the case of Fastmap [7]). The TV-tree allows the size of feature vectors to be varied in di€erent levels of the index structure: this enables

Fig. 1. The dimensionality curse.

C.H. Goh et al. / Data & Knowledge Engineering 32 (2000) 115±130

117

a small number of features with good discriminatory power to be used for organizing the top-level directories, resulting in higher fanout and greater eciency. Fastmap, on the other hand, is primarily intended for mapping objects into points in a k-dimensional space so that distances between objects are well-preserved (without requiring a domain expert to provide a set of k feature-extraction functions). By allowing objects which are ``similar'' to be clustered together, this mapping algorithm provides the basis for the construction of e€ective indexes using existing multi-dimensional indexing methods. Both Fastmap and TV-tree are however primarily intended for supporting similarity-based retrieval. Speci®cally, the underlying index organization assumes that all feature values are known. For example, the TV-tree will be of little help if values for features used as discriminators in top-level nodes are not known. This suggests that they are not quite as useful for generic multi-attribute indexing, where the most discriminating attributes may not be part of the search criteria. The second approach, proposed in [2], is to reduce the amount of overlap in the nodes of an Rtree index structure by increasing the fan-out of selected nodes. Hence, instead of creating new branches which may have signi®cant overlapping minimal bounding rectangles (MBRs), an X-tree is allowed to have internal nodes which may be of variable size (in multiples of the usual block size). It is observed that under these circumstances, it is more economical to search the enlarged node (called a supernode) sequentially rather than breaking them up into distinct subtrees in an arti®cial manner. Notwithstanding, the size of a supernode cannot be enlarged inde®nitely, since any increase in node size contributes ultimately to additional page accesses and CPU cost, causing the index to degenerate into a semi-sequential ®le. Further, it remains to be determined if eciency gains reported for the X-tree can be sustained under di€erent data distributions and for di€erent data volumes. In this paper, we propose a third approach whereby dimensionality of data is reduced through the use of the Hilbert space-®lling curve. Given a database of objects with attributes A1 ; . . . ; Ad , these are mapped to a k-dimensional (k < d) space by performing a ``linear walk'' in the k-subspaces corresponding to k clusters of attributes which form a partition of A1 ; . . . ; Ad . This e€ectively reduces the dimensionality of the data set to be indexed. This allows the resulting data to be eciently indexed using a lower dimensional classic R-tree or its variants. Experimental studies are conducted and their results indicate that the technique, a ``dimensionally-reduced'' R-tree, is capable of providing signi®cant performance gains at high dimensions. The remainder of this paper is organized as follows. Following this introduction, we present details on the transformation that maps from a d- to a k-dimensional space. Next, Section 3 describes search and update algorithms. Section 4 presents experimental results in which the performance of the ``dimensionally-reduced'' R-tree is compared to classic R-tree indexes. Section 5 summarizes our contribution and describes future work. 2. Dimensionality reduction Instead of the signal processing techniques used in TV-trees and Fastmap, we make use of a space-®lling curve to achieve the goal of dimensionality reduction. Several examples of space®lling curves have been proposed in the literature: the z-ordering [14], Gray code [5], and the Hilbert curve [10]. Space-®lling curves provide the means for introducing a locality-preserving total ordering on points in a multi-dimensional space. This peculiar property has been exploited in

118

C.H. Goh et al. / Data & Knowledge Engineering 32 (2000) 115±130

many indexing structures and is widely recognized as a general technique for reducing a multiattribute indexing problem to a classical single-attribute indexing problem. The choice of particular space-®lling curves for reducing the dimensionality of data a€ects only how data points in a d-dimensional space are mapped to corresponding coordinates in a k-dimensional space. As we will see later, this determines how a query region is decomposed (into query regions of lower dimensionality) but not fundamentally how search and update operations are carried out. In this paper, we will adopt the Hilbert space-®lling curve for carrying out this transformation, since it has been reported that mapping based on the Hilbert curve outperforms the others under most circumstances [10]. 2.1. The Hilbert space-®lling curve Following [13], we assume that the multi-dimensional space can be partitioned into a ®nite set of discrete cells. The Hilbert curve can be constructed in a recursive manner. Fig. 2 provides an illustration of this construction via a 2-dimensional example. We begin with the basic curve residing on the unit region, as shown in Fig. 2(a): this is the Hilbert curve of order 1. To construct the curve of order 2 as in Fig. 2(b), we divide the region into four quadrants, and the same basic curve is placed in each of these quadrants with appropriate rotation and order of traversal (sense). This is repeated again to obtain the ®nal curve shown in Fig. 2(c). By repeating this process, we are able to construct a Hilbert curve which visits every cell (of any prede®ned granularity) exactly once. The order in which cells are visited by a Hilbert curve are assigned as their corresponding Hvalues, thereby de®ning a total ordering on the cells. A general algorithm for the assignment of Hvalues can be found in [8] and will not be repeated here. 2.2. Transformation based on the Hilbert curve The key insight underlying the proposed indexing technique lies in the observation that signi®cant improvement in performance can be achieved by reducing the dimensionality of data which is being indexed. Previous approaches to exploiting space-®lling curves for multi-attribute indexing have focused on mapping to a one-dimensional space [10]. By e€ecting the mapping to a k-dimensional space (where 1 < k  d), we are able to make use of the R-tree indexing technique which is known to perform well for low dimensional data.

Fig. 2. Recursive construction of the Hilbert curve.

C.H. Goh et al. / Data & Knowledge Engineering 32 (2000) 115±130

119

In most instances, a d-dimensional region will not map exactly into a single contiguous region in a lower-dimensional space. Fig. 3(a) illustrates this anomaly in the case where d ˆ 2. The shaded region corresponds to the MBR de®ned by …2; 0†±…3; 2†. However, the H-values corresponding to the two cells are 14 and 11, respectively. The interval ‰14; 11Š is not a valid MBR in the 1-dimensional space; instead, the original 2-dimensional region can be represented by three intervals ± ‰11; 11Š, ‰8; 8Š, and ‰12; 15Š ± as shown in Fig. 3(b). The decomposition described in the preceding paragraph is accomplished using the operation Decompose described below. Decompose accepts a range query (in the form of a convex query region on the native d-dimensional space) and a partition of attributes A1 ; . . . ; Ad into k clusters, which we denote by C1 ; . . . ; Ck . It returns a set of regions …Q† on the reduced k-dimensional space. Algorithm Decompose(Q; fC1 ; . . . ; Ck g):Q. 1. (Project the query region Q on each cluster Ci to yield the mi -dimensional region Qi , where mi is the cardinality of the set Ci .) 2. (Identify the hypercubes in each cluster.) for i ˆ 1 to k do Ri ˆ dom(Ai1 †      dom(Aij ) where j ˆ is the cardinality of Ci and dom(Ai ) refers to the domain of attribute Ai (assumed ®nite) Hi ˆ Decompose-R(Qi ; Ri ). 3. (Return the permutation of hypercubes obtained from each cluster.) Q ˆ fQ 2 H1      Hk g Algorithm Decompose-R(Q; R):H 1. (Termination condition.) If Q ˆ R then return H ˆ R

Fig. 3. Decomposition of a 2-dimensional region into 1-dimensional intervals.

120

C.H. Goh et al. / Data & Knowledge Engineering 32 (2000) 115±130

2. (Divide the region into 2m hypercubes where m is the dimensionality of the input region R.)  ˆ Divide(R) R 3. (Call Decompose-R recursively.)  do for each Ri 2 R Project Q onto Ri to yield Qi Hi ˆDecompose-R…Qi ; Ri †  jRj H ˆ [iˆ1 Hi . Fig. 4 illustrates the Decompose algorithm in the case where d ˆ 2 and k ˆ 1, and where the input region is given by …2; 0†±…3; 2†. In this instance, we have only a single cluster C1 ˆ fA1 ; A2 g. The ®rst call to Decompose-R will have the parameters Q ˆ ‰…2; 0†; …3; 2†Š and R ˆ f…x; y†jx; y 2 0::3g. Since Q 6ˆ R, we divide the input region into four quadrants (Step 1), and call Decompose-R recursively. The lower-right-hand quadrant now satis®es the termination condition (Step 2) and is returned; the cells 8 and 11 will be returned in the next recursive call to Decompose-R. The ®nal answer is given by Q ˆ f8; 11; 12::15g. To illustrate the e€ects of having more than one cluster of attributes, consider the example presented in Fig. 5 where the native search space is 4-dimensional. Suppose we partition the four search attributes into two clusters. Fig. 5(b) shows the projection of the original query region along the two surfaces (xy and wz) and the corresponding H-values. The 1-dimensional intervals are combined to yield the 2-dimensional search space identi®ed in Fig. 5(c). 3. Search and update operations Using the strategy proposed in this paper, ecient clustering of data records with respect to a collection of d attributes is accomplished using standard R-tree indexing techniques, except that this is preceded by a reduction phase where data records are mapped to points in a k-dimensional space de®ned by the H-values of the respective clusters. Data records in an R-tree are stored in the leaf nodes of this tree, and consist of a list of entries of the form …a1 ; . . . ; ak ; q†; where q refers to the key or identity or some data object, and ai are values of q corresponding to attribute Ai . Internal nodes of an R-tree consist of entries given by …MBRi ; Pi †; where MBRi de®nes a k-dimensional minimal bounding region on all data points present in the leaf nodes which are reachable by following the child-pointer Pi . Search operations on the ``dimensionally-reduced'' R-tree are somewhat di€erent from their conventional counterpart because it is no longer possible to search the d-dimensional space directly; instead, the d-dimensional subspace de®ned in a range query must now be transformed to a collection of k-dimensional search regions. Algorithm RangeSearch…Q; P †:L. 1. (Initialization) R ˆ dom…A1 †  dom…A2 †      dom…Ad †.

C.H. Goh et al. / Data & Knowledge Engineering 32 (2000) 115±130

Fig. 4. An illustration of the Decompose algorithm for d ˆ 2 and k ˆ 1.

121

122

C.H. Goh et al. / Data & Knowledge Engineering 32 (2000) 115±130

Fig. 5. An illustration of the Decompose algorithm for the case where d ˆ 4 and k ˆ 2.

2. (Decompose the range query Q into a list of queries on sub-regions.) Q ˆDecompose-R…Q; R†. 3. (Invoke Search recursively.) L ˆ Search…Q; P †. The Search operation shown below is a generalized form of its counterpart for regular ``native'' R-trees: instead of searching only a single region, the Search operation accepts as input a disjoint set of search regions. When traversing a dimensionally-reduced R-tree, a subtree will be explored

C.H. Goh et al. / Data & Knowledge Engineering 32 (2000) 115±130

123

only if it encompasses a region which intersects with at least one region in the input collection of search regions. This guarantees that each node in the tree will only be visited once, although it remains necessary for multiple paths to be traversed. This algorithm is formally presented below. Algorithm Search…Q; P †:L. 1. (Termination condition.) If P points to a leaf-node N, then Lˆ/ foreach entry …a1 ; . . . ; ak ; q† 2 N do if …a1 ; . . . ; ak † 2 R then L ˆ L [ fqg 2. (P does not point to a leaf-node, then call Search recursively on all branches which may contain answers.) Lˆ/ foreach entry (MBRi , Pi ) in node referred to by P DQ ˆ fQ 2 QjQ \ MBRi 6ˆ /g if DQ 6ˆ / then DL ˆ Search(DQ; Pi ) L ˆ L [ DL end-if Insertion and deletion in the dimensionally-reduced R-tree is similar to classical counterparts, with the exception that these operations must now be preceded with a call to Decompose so that they are mapped to the corresponding (lower) dimensional space. As before, insertions may result in node splits. The choice of node-splitting strategies are, however, orthogonal to the issues discussed in this paper. 4. Experimental results To verify the practicality of dimensionality reduction as proposed in this paper, we performed an experimental study of the eciency of dimensionally-reduced R-trees compared to native-space R-trees. In [3], it is shown that although the R-tree is not as ecient as the R -tree at low dimensionality, its query performance outperforms the R -tree as the number of dimensions increases. We therefore concede that the R-tree is a good representative index for the R-tree family in the context of this study. Indexes studied in our experiments are implemented using a block size of 4 Kbyte. The quadratic cost splitting algorithm from [9] is used in node splitting. The experiments are conducted on a Sun SPARC Ultra-2 workstation running the SunOS 5.5. Several parameters used in our experiments, together with their default settings, are shown in Table 1. The database used in the experimental study consists of Fourier points corresponding to contours of industrial parts. This is the same data set used in the X-tree experiments reported in [2]. We extracted the ®rst 1 million records from this data set for the purpose of our experiments, and pruned the number of attributes to 2, 4 and 8. We also converted the data from real numbers to integers so that all attributes range from 1 to 100; 000; 000.

124

C.H. Goh et al. / Data & Knowledge Engineering 32 (2000) 115±130

Table 1 Parameters and their values Parameter

Default values

System parameters Page size Index node size Bu€er size

4K page 4K page 128 pages

Database parameters Number of tuples Number of dimensions Domain of dimensions

1 million 4, 8 [1..100000000]

Query parameters Query type Number of queries per query type

Point 1000

Variations

2

Range ± 0.00001±10%

The data distribution is skewed. Fig. 6 shows the data distribution of the ®rst four attributes. It is obvious that the data follows the Zipf distribution. Amongst these four attributes, attributes 2 and 4 are more skewed than the other two. As shown later in Fig. 10, these may have an impact on the query eciency of these indexes. To better approximate the conditions under which the indexes will be used, we also simulated the e€ects of caching in our experiments. All indexes are implemented to take advantage of a priority-based replacement strategy [4]. This replacement strategy uses a Least-Useful (LUF) policy which exploits the retrieval pattern of index pages; speci®cally, priority is assigned to the pages depending on the ``level'' a page resides in. A page is considered to be useful if it will be referenced again in the near future (in the traversal of the index structure); otherwise, it is considered as useless. The LUF strategy therefore uses the following heuristics in determining which pages to replace:

Fig. 6. Data distribution of attributes 1, 2, 3 and 4.

C.H. Goh et al. / Data & Knowledge Engineering 32 (2000) 115±130

125

· useless pages are considered for replacement before useful pages; · among the useless pages, those that reside nearer to the leaf nodes will be replaced ®rst (ties are resolved arbitrarily); and · among the useful pages, those that reside in the upper levels of the index will be replaced ®rst before those which are in lower levels (i.e. those near the leaf). This is based on the observation that in depth-®rst search, the pages in the lower levels will be accessed again before those nearer to the root. The following types of queries are used: 1. Point queries consisting of 1000 points that are randomly selected from the database. 2. Region queries are obtained by extending the point queries such that the coordinate in each dimension is enlarged to cover a range of 10, 50, 100, 1000 (0.001% of range), 5000 (0.005%), 10000 (0.01%), 50000 (0.05%), 100,000 (0.1%), 500,000 (0.5%), 1,000,000 (1%), 5,000,000 (5%) and 10,000,000 (10%). The number in the bracket indicates the percentage of the length of the query with respect to the whole data space. 3. Slice queries are obtained by extending the coordinates of point queries along certain dimensions. 4.1. Experiment 1: Point queries Apart from the R-tree, we introduce two dimensionally-reduced R-trees: R-tree(2dr) and Rtree(4dr). The R-tree(2dr), an R-tree with 2-dimensional reduction, clusters the dimensions into groups of two, and for each group, the data values are mapped into single-dimensional values. The R-tree(2dr) therefore reduces the number of dimensions to half. As an example, an 8-dimensional R-tree(2dr) is implemented using a 4-dimensional classic R-tree. Similarly, the Rtree(4dr), an R-tree with 4-dimension reduction, clusters the dimensions into groups of four, and reduces the number of dimensions by four times. In this case, an 8-dimensional R-tree(4dr) is implemented using a 2-dimensional classic R-tree. Fig. 7 shows the average disk accesses required for a point query with respect to the dimensionality of the data. The result shows that the R-tree(4dr) is more ecient than the R-tree(2dr).

Fig. 7. Point query performance.

126

C.H. Goh et al. / Data & Knowledge Engineering 32 (2000) 115±130

This is expected as the index tree of the R-tree(4dr) is implemented as a 2-dimensional R-tree, and therefore maintains the eciency of the classic R-tree. The higher degree of dimensionality reduction yields better results due to less overlap between bounding boxes in the internal nodes. Both dimensionally-reduced R-trees perform signi®cantly better than the R-tree and the performance gain increases as the number of dimensions increases. 4.2. Experiment 2: Region queries In this experiment, we use 4-dimensional and 8-dimensional databases. We vary the query size from 10 on each dimension to 10,000,000, which is 10% of the data space. Fig. 8(a) and (b) show the experimental results on region query search with respect to increasing query sizes. Due to the dramatic increase in search cost over the query size, logarithm scales are used for both X and Y axis. In general, the R-trees with dimensionality reduction perform much better than the classic Rtree. However, as the query size grows, performance gain of the R-trees(2dr and 4dr) over the classic R-tree decreases. When the query size is very large, the number of subtrees that have to be searched is relatively large regardless of which index is used, and therefore search paths and high degree of overlap of an R-tree becomes less of an issue. The converging costs for large query sizes are apparent from Figs. 8 and 9. Graphs in Fig. 9 show the experimental results on region queries over a query length of 1,000,000 and 5,000,000, respectively. With increasing query length, the query eciency of the Rtree with 2-dimensional reduction degrades faster than the one with 4-dimensional reduction. 4.3. Experiment 3: Slice queries Due to its dimensionality problems, the R-tree is not expected to perform well for OLAP applications even though it can be applied directly. Unlike specialized indexes such as bitmap indexes, its symmetric treatment of all dimensions in OLAP data space does not require additional space overhead of the hierarchical indexes and processing overhead of bit operations. It is therefore not surpising that recent attempts at using the R-tree are limited to data cubes that have dramatically reduced dimensionality [15].

Fig. 8. Region query performance.

C.H. Goh et al. / Data & Knowledge Engineering 32 (2000) 115±130

127

Fig. 9. E€ect of query length.

In OLAP applications, slice queries that correspond to …k ÿ i†-dimensional hyperplanes (0 < i < k) are needed to compute GROUPBY operators on the data cube. To test if the dimensionally-reduced R-tree is ecient for such kind of queries, we conducted four experiments on a 4dimensional data space. For each test, we restrict the value of one attribute to a point, while allowing the others to take on range values. These queries correspond to (3-1)-dimensional hyperplanes cutting across the cube along one dimension. From Fig. 10, we observe that the performance of the R-tree generally depends on the data and query distribution of each individual attribute. This e€ect becomes more pronounced as the query length increases, but less so for the dimensionally-reduced R-trees. 4.4. Experiment 4: E€ects of attribute clustering The eciency of the dimensionally-reduced R-tree depends to a large extent on which attributes are clustered. The choice of di€erent attribute clusters may lead to vastly di€erent performance. The best performance is likely to be obtained when the attributes selected result in even distribution of the points in the dimensionally reduced data space. This clustering can therefore be performed by expert users or via the use of statistical techniques where correlation between attributes can be discovered. To illustrate the preceding comments, consider the ®rst three attributes of the data set that is used in the experimental study. Fig. 11 shows clustering e€ects between attributes 1 and 2, and between attributes 1 and 3. It can be easily observed that data points are more densely clustered for attributes 1 and 3. This suggests that many more points are likely to be mapped to the vincinity of one another. Fig. 12 summarizes the performance of two R-trees(2dr) with two di€erent clusters: Rtree(2dr, 12) for clustering attributes 1 with 2, and 3 with 4 together; R-tree(2dr, 13) for clustering attributes 1 with 3, and 2 with 4 together. The results show that R-tree(2dr, 12) performs slightly better than the R-tree(2dr, 13). This is consistent with our earlier observation that attributes that are densely clustered result in poorer performance. This is not surprising since attributes that are

128

C.H. Goh et al. / Data & Knowledge Engineering 32 (2000) 115±130

Fig. 10. E€ect of slice queries.

Fig. 11. Data clusters.

C.H. Goh et al. / Data & Knowledge Engineering 32 (2000) 115±130

129

Fig. 12. E€ect of clustering.

densely clustered result in more skewed distributions of dimensionally-reduced objects, thus increasing the overlap amongst bounding boxes of the R-tree. 5. Conclusion In this paper, we have presented a technique for dimensionality reduction that allows data in high-dimensional space to be reduced to a corresponding low-dimensional space which can be indexed more eciently using the classical R-tree family of indexes. Previous approaches to dimensionality reduction have focused exclusively on the use of signal transformation methods (e.g., see [11]) which are ideal for similarity-based retrieval of feature vectors but are unsuitable as a general indexing technique for multi-dimensional data. While the use of a space-®lling curve as a pre-processing technique for multi-dimensional indexing is not new [10], we appear to be the ®rst to have studied how this can be exploited along with the excellent indexing performance of R-trees in low dimensions. The underlying transformation has been formally described in this paper along with modi®cations to the search and update operations for a ``dimensionally-reduced'' R-tree. Experimental studies have indicated that the technique introduced here is capable of providing signi®cant performance gains at high-dimensions. We are extending our work in several directions. First, we are exploring how the same idea can be applied on other existing structures (i.e., integrate Hilbert curve with other hierarchical indexing structures). Second, we are studying the issue of optimality of clusters. We believe that the optimal cluster size is likely to be domain and data dependent, i.e., for some domain/dataset, a smaller cluster size is bene®cial while a larger cluster size is better for others. We plan to explore this further. References [1] N. Beckmann, H.-P. Kriegel, R. Schneider, B. Seeger, The R -tree: an ecient and robust access method for points and rectangles, in: Proceedings of the ACM SIGMOD International Conference on Management of Data, Atlantic City, NJ, 1990, pp. 322±331. [2] S. Berchtold, D.A. Keim, H.-P. Kriegel, The X-tree: an index structure for high-dimensional data, in: Proceedings of the VLDB, 1996.

130

C.H. Goh et al. / Data & Knowledge Engineering 32 (2000) 115±130

[3] E. Bertino et al., Indexing Techniques for Advanced Database Systems, Kluwer Academic Press, Dordrecht, 1997. [4] C.Y. Chan, B.C. Ooi, H. Lu, Extensible bu€er management of indexes, in: Proceedings of the VLDB, 1992, pp. 444±454. [5] C. Faloutsos, Multiattribute hashing using gray codes, in: Proceedings of the ACM SIGMOD International Conference on Management of Data, May 1986, pp. 227±238. [6] C. Faloutsos, Searching Multimedia Databases by Content, Kluwer Academic Press, Dordrecht, 1996. [7] C. Faloutsos, K.-I. Lin, Fastmap: a fast algorithm for indexing data-mining and visualization of traditional and multimedia datasets, in: Proceedings of the ACM SIGMOD International Conference on Managementof Data, May 1995, pp. 163±174. [8] C. Faloutsos, S. Roseman, Fractals for secondary key retrieval, in: Proceedings of the PODS, March 1989, pp. 247±252. [9] A. Guttman, R-trees: A dynamic index structure for spatial searching, in: Proceedings of the ACM SIGMOD International Conference on Management of Data, Boston, MA, 1984, pp. 47±57. [10] H.V. Jagadish, Linear clustering of objects with multiple attributes, in: Proceedings of the ACM SIGMOD International Conference on Management of Data, May 1990, pp. 332±342. [11] K.-I. Lin, H.V. Jagadish, C. Faloutsos, The TV-tree: an index structure for high-dimensional data, The VLDB Journal 3,4 October 1994, pp. 517±542. [12] D. Lomet, A review of recent work on multi-attribute access method, ACM SIGMOD Record 21, 3 September 1992, pp. 56±63. [13] B. Moon, H.V. Jagadish, C. Faloutsos, J.H. Saltz, Analysis of the Clustering Properties of Hilbert Space-Filling Curve, Technical Report, Maryland 1996. [14] J. Orenstein, Spatial query processing in an object-oriented database system, in: Proceedings of the ACM SIGMOD International Conference on Management of Data, May 1986, pp. 326±336. [15] N. Roussopoulos, Y. Kotidis, M. Roussopoulos, Cubetree: organization of and bulk incremental updates on the data cubes, in: Proceedings of the ACM SIGMOD International Conference on Management of Data, Tuscon Arizona, 1997, pp. 89±99. [16] T. Sellis, N. Roussopoulos, C. Faloutsos, The R+-tree: a dynamic index for multi-dimensional objects, in: Proceedings of the VLDB, Brighton, UK, 1987, pp. 507±518.

Cheng Hian Goh (1965±99) received his B.Sc (Hons) and M.Sc in computer science from the National University of Singapore in 1990 and 1992 respectively, and his Ph.D. in Information Technologies from the Massachusetts Institute of Technology in 1997. He joined the Department of Computer Science, National University of Singapore as an Assistant Professor in November 1996. Cheng Hian passed away on 1 April 1999. He contributed much to the research community. In particular, his works on the Context Interchange Mediator Project at MIT have been well received by the database community.

Agnes Lim completed her B.Sc (Hon) in computer science from the Department of Computer Science at the National University of Singapore in 1998. Agnes's interests include databases, transportation and logistics, and operation research. She is currently working at the Computer Division of Systems & Computer Organization.

Beng Chin Ooi received his B.Sc. and Ph.D. in computer science from Monash University, Australia, in 1985 and 1989 respectively. He was with the Institute of Systems Science from 1989 to 1991 before joining the Department of Computer Science, National Unviersity of Singapore. His research interests include database performance issues, multi-media databases and applications, high dimensional databases and GIS. He is the author of a monograph entitled ``Ecient Query Processing in Geographic Information Systems'' (Springer-Verlag, LCNS #471, 1990), and a co-author of a monograph entitled ``Indexing Techniques for Advanced Database Applications'' (Kluwer Academic Publishers, 1997). He has published over 50 conference/ journal papers and served as a PC member for a number of international conferences (including ACM SIGMOD, VLDB, EDBT) and serves as an editor for Geoinformatics, Int. Journal of GIS, ACM and IEEE Computer Society.

Society.

Kian-Lee Tan received his B.Sc (Hons) and Ph.D. in computer science from the National University of Singapore (NUS), in 1989 and 1994 respectively. He is currently an Assistant Professor in the Department of Computer Science, NUS. His major research interests include multimedia information retrieval, wireless information retrieval, query processing and optimization in multiprocessor and distributed systems, and database performance. Kian-Lee is a member of ACM and IEEE Computer