Dimensionality reduction of multidimensional temporal data through regression

Dimensionality reduction of multidimensional temporal data through regression

Pattern Recognition Letters 25 (2004) 899–910 www.elsevier.com/locate/patrec Dimensionality reduction of multidimensional temporal data through regre...

483KB Sizes 1 Downloads 45 Views

Pattern Recognition Letters 25 (2004) 899–910 www.elsevier.com/locate/patrec

Dimensionality reduction of multidimensional temporal data through regression q Lalitha Rangarajan *, P. Nagabhushan Department of Studies in Computer Science, University of Mysore, Manasagangothri, Mysore 570 006, Karnataka, India Received 28 October 2003; received in revised form 16 February 2004 Available online 20 March 2004

Abstract A new method for pattern classification of multidimensional temporal data/images is proposed. In temporal data/ images, each feature of a sample/pixel is not just a single numerical value, but a set (vector) of real values. The method proposed transforms the pattern of change in the feature values, over time, into representative patterns, termed as symbolic objects [Bock, Diday (Eds.), Analysis of Symbolic Data, Springer Verlag, 2000], which are obtained through regression lines. Since a regression line symbolizes a sequence of numerical values of a feature vector, the so defined symbolic object accomplishes dimensionality reduction of the temporal data. A new distance measure is devised to measure the distances between the symbolic objects (fitted regression lines) and clustering is performed. The method is very versatile and is readily applicable to any multidimensional temporal image. The algorithm is tested on two different data sets.  2004 Elsevier B.V. All rights reserved. Keywords: Pattern classification; Temporal feature analysis; Symbolic data; Regression; Dimensionality reduction; Clustering; Data assimilation

1. Introduction Measurements made in pattern recognition applications are inherently multidimensional in nature. Larger the number of features, more severe is the problems of storage and analysis time requirements (Lillesand and Keifer, 1992; Jolliffee, q

This research is supported by Indian Space Research Organization’s project RESPOND 10/4/317, October 1999. * Corresponding author. E-mail addresses: [email protected] (L. Rangarajan), [email protected] (P. Nagabhushan).

1986). Hence much importance has been attributed to the process of dimensionality reduction or feature reduction (Jolliffee, 1986). Aim of dimensionality reduction is to describe the samples by means of a minimum number of features that are necessary for discrimination of classes in the image. A variety of methods for dimensionality reduction are proposed in literature (Jolliffee, 1986; Linganagouda et al., 1994; Nagabhushan, 1988). Most of these belong to either sub-setting methods (Lillesand and Keifer, 1992; Jolliffee, 1986), or feature space transformation methods (Jolliffee, 1986; Srikantaprakash et al., 1997;

0167-8655/$ - see front matter  2004 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2004.02.003

900

L. Rangarajan, P. Nagabhushan / Pattern Recognition Letters 25 (2004) 899–910

Linganagouda et al., 1994; Nagabhushan, 1988) like principal component analysis (PCA) (Jolliffee, 1986). Some methods are also a combination of these two (Jolliffee, 1986; Linganagouda et al., 1994; Nagabhushan, 1988). Dimensionality reduction by PCA (Jolliffee, 1986) is transformation followed by selection; whereas ‘triant space’ and ‘‘stereo graphic projection’’ methods in (Gowda and Nagabhushan, 1989; Nagabhushan, 1988) are transformation that result in reduction. In a typical image every feature of a pixel (a sample in a general PR problem) is just a single numerical value. In temporal images every feature of a pixel is a set of real values. This is because the same target is imaged at different time steps and the same set of features measured in all time frames and in the same order. Thus the temporal observations further increase the dimension from f to f  t (f is the dimension of the image and t is the number of time frames). The problem of dimensionality reduction of such multidimensional temporal data deserves a dedicated research treatment. This is one of the major objectives addressed in this paper. Temporal imaging/observations use time as an aid to classify image. Some of the methods to study temporal images use thresholds on feature values for pattern classification, obtain classification maps of successive time steps and use them all for a classification of total data set (Srikantaprakash et al., 1997). Gettler-Summa and Pardoux (Bock and Diday, 2000; Gettler-Summa, 1999) have developed interpretation for temporal data by symbolic object approach. They have performed factorial analysis on each time step, followed by clustering on temporal changes and arrived at links between time ordered symbolic objects. Proposed classification method is very much different from this approach. Our method looks the data set through the third dimension namely time. We have made an attempt to summarize these t dimensional temporal feature vectors of a sample, using a symbolic representation in terms of regression curves. This results in data assimilation of temporal feature values of a sample. Section 2 outlines the methodology. Details of fitting regression lines are described in Section 3. In Section 4 we have discussed the computation of ‘distances’ between the symbolic objects (regres-

sion lines), which is the foundation for clustering the samples. Section 5 discusses clustering procedures. In Section 6 experiments, results and comparative studies can be found. We conclude the paper in Section 7 highlighting drawbacks and suggestions for improvements.

2. Outline of the method One of the effective means of classifying temporal data is multitemporal profile approach (Lillesand and Keifer, 1992). The use of change detection for classification is the purpose of temporal imaging. Samples, which are seemingly alike at a particular time step, need not exhibit similarity through all time steps. For example in a remotely sensed image, water body and a paddy field full of water before sowing may be classified together. But this image when combined with another image of the same area taken at a later time will surely give us the true classes. Two samples can be called ‘similar’ only if they are similar over time. Hence we have developed a method to study the behavior of samples over time and use this to do pattern classification. In short, our method detects the changes in feature values of the samples over time and uses these changes for classification. In temporal images every feature of a pixel is a set (vector) of real values, cardinality/dimensionality of the set being the number of time steps t during which the image is observed. Each feature can be viewed as quantitative multivalued symbolic variable, as termed in an advanced exploratory data analysis technique called Symbolic Data Analysis (Bock and Diday, 2000). For example, suppose we have three features, four time step data. A sample will be described by three sets, one for each feature and size of each set is 4 (the number of time steps). The distance between such (set type) symbolic objects can be measured using (i) the number of total elements in both the sets (Cartesian join =[) and (ii) the number of common elements in both the sets (Cartesian meet =\) (Bock and Diday, 2000; Ichino and Yaguchi, 1994). These distances between various features can then be combined with generalized Minkowski’s distance (Bock and Diday, 2000; Ichino and

L. Rangarajan, P. Nagabhushan / Pattern Recognition Letters 25 (2004) 899–910

Yaguchi, 1994). The distance between features can also be measured using agreement (\ of both sets) and disagreement (\ of a set and complement of the other set) as suggested by De Carvalho (1998) (Bock and Diday, 2000) and again aggregate these component distances with generalized Minkowski’s metric. Both these distance measures are not suitable for our symbolic object although our data is also multivalued quantitative. Our ‘set’ is sensitive to arrangement of elements within the set, unlike the ‘set’ in (Bock and Diday, 2000; Ichino and Yaguchi, 1994). That is the following two sets of jth feature values of samples x and y namely, f1:6; 2:4; 4:7; 10:8g and f4:7; 1:6; 2:4; 10:8g are not the same. The first of these sets mean that feature j values at time steps 1, 2, 3 and 4 of sample x are 1.6, 2.4, 4.7, 10.8; whereas the second set indicates that feature j values at time steps 1, 2, 3 and 4 of sample y are 4.7, 1.6, 2.4, 10.8. Our features are not only sets but also vectors. Hence the symbolic representation should take care of not only the set contents but also the arrangement of the numbers. We have used ‘regression curves’ for describing each feature of a sample. This symbolic transformation can take care of both the contents of the set and the arrangement of the set elements. For the above example of feature j values of samples x and y, our symbolic objects are regression curves fitted for points ð1; 1:6Þ, ð2; 2:4Þ, ð3; 4:7Þ, ð4; 10:8Þ and for points ð1; 4:7Þ, ð2; 1:6Þ, ð3; 2:4Þ, ð4; 10:8Þ. These regression curves are entirely different. We have used ‘least square error’ regression curves to symbolize features of a sample over time. If the number of temporal frames is t, the t feature j values are reduced to a regression curve of much less order. Hence we have reduced the dimension of the data set too. The s sample items, each having f features over t time steps (size of the data is s  f  t) can be reduced to s  f regression curves. The transformation results in the assimilation of f  t features into f regression curves. A good regression curve provides an apt description of the samples. But it is difficult to determine the nature of regression curves. Even if we do, the jth feature values of different samples could yield different types of regression curves. For instance the t values of jth feature of samples i and k may become a quadratic and a cubic curve. This

901

amounts to keeping track of type of regression curves and also the parameters of regression curves. This results in too much of book keeping, making the problem of classification too tedious. Probably this may even contradict the possibilities of the very theme of the research, that is, dimensionality reduction and data assimilation. Such as an arbitrary curve can be approximated by piecewise linear segments (Leung and Yang, 1990; Dunham, 1986), a regression curve is represented by piecewise linear regression line segments. The time axis of the ith sample, jth feature has been sliced and regression lines identified in each slice. The number and length of the slices are kept uniform for all features and samples, to ease the problem of measuring distance between samples. In some time slice, the set of feature values of a sample could be typically nonlinear. Or some other slicing scheme may produce good line segments (having far less error between actual and linearized data) for some samples. However these issues are not very crucial for classification of samples. These details may be very important in data mining applications. Our objective is to find best possible lines for the given sample in each time slice and get an equally good classification as with the original data. To minimize the error between the data and the fitted lines, we slice the time axis (if the number of temporal frames is plenty) and find regression lines in each slice. The length of the slice does not have a significant impact on classification when two classes are well separable atleast at one point in time. A good slicing scheme would be that which reduces the dimension reasonably. For example if the length of the slice is 4, we achieve reduction of half the original dimension, since the fitted line for these four feature values is characterized by two parameters. The straight lines fitted in each time slice are more structured when compared to individual feature values. Also it is interesting to note that individual feature values are variant where as a straight line is invariant (within a slice) for a sample. In short s  f  t data items have been reduced to s  f  c regression lines, where c is the number of slices of time axis. A new distance measure has been proposed to measure the distance between the set of regression lines (Rangarajan and Nagabhushan, 2000).

902

L. Rangarajan, P. Nagabhushan / Pattern Recognition Letters 25 (2004) 899–910

3. Method in detail

Feature 1

Feature 2

*

Let the data items of the ith sample, jth feature and kth time step be represented by d½i; j; k where 1 6 i 6 s, 1 6 j 6 f , 1 6 k 6 t. ith sample, jth feature values fd½i; j; 1 ; d½i; j; 2 ; . . . ; d½i; j; t g is divided (cut) into c slices each slice having t=c data items. In each time slice taking T (independent variable set) to be the set of time steps and D (dependent variable set) to be the set of feature values at these time steps a least square error regression line is found. For example a least square error regression line is fitted for points fð1; d½i; j; 1 Þ; ð2; d½i; j; 2 Þ; . . . ; ðt=c; d½i; j; t=c Þg in the first time slice. Similarly in the second time slice, regression line is identified for the points fðt=c þ 1; d½i; j; t=c þ 1 Þ; ðt=c þ 2; d½i; j; t=c þ 2 Þ; . . . ; ð2t=c; d½i; j; 2t=c Þg and so on. Thus d½i; j; k are transformed into c regression lines of the form d ¼ a þ bt, where d is the dependent variable of feature values and t is the independent variable of time. The above process of splitting t time steps into c slices and finding regression lines in all slices is repeated for all features and samples. The algorithm can be best understood with the example, where for a sample the number of features are 2, time steps are 6 and number of time slices is 2, illustrated in the Figs. 1 and 2.

Algorithm 1. Representation of data using regression lines Input: d½i; j; k : 1 6 i 6 s, 1 6 j 6 f , 1 6 k 6 t, and c the number of slices of time axis. Output: The coefficients of regression lines, a½i; j; p (y intercept), b½i; j; p (slope) for 1 6 i 6 s, 1 6 j 6 f , 1 6 p 6 c. 1. For all samples i ¼ 1 to s 2. For all features j ¼ 1 to f 3. For all slices p ¼ 1 to c 4. For the data set T ¼ fðp  1Þt=c þ 1; ðp  1Þt=c þ 2; . . . ; pt=cg, D ¼ fd½i; j; ðp  1Þt=c þ 1 ; d½i; j; ðp  1Þt=c þ 2 ; . . . ; d½i; j; pt=c g find the regression line d ¼ a½i; j; p þ b½i; j; p  t 5. Return (a½i; j; p , b½i; j; p ) End 3 End 2 End 1

*

* *

*

*

*

* *

*

*

* time

time

Fig. 1. The sample with two features and six time steps.

Feature 1

Feature 2

* *

* *

*

*

*

* *

*

*

time

*

time

Fig. 2. Algorithm 1 may find following lines.

4. Computation of distance measure Two samples m,n are similar when d½m; j; k and d½n; j; k are close (nearly the same) for all features (1 6 j 6 f ) and all time steps (1 6 k 6 t). Let L½sam; fea; sl , R½sam; fea; sl be points of intersections of the regression line of sample ‘sam’ for feature ‘fea’ in the time slice ‘sl’ with the ordinates at left (beginning) and right (end) of the slice ‘sl’. That is L½sam; fea; sl R½sam; fea; sl is the regression line of sample ‘sam’ for the feature ‘fea’ in the time slice ‘sl’. It is clear that m, n are similar whenever all L’s of samples m,n and all R’s of samples m,n are close on the respective ordinates. That is whenever the distance (absolute) between L’s and the distance (absolute) between R’s of the samples, on the respective ordinates are small. Our distance measure is a simple dissimilarity measure that is taken to be the maximum of the following distances fL½m; 1; 1 L½n; 1; 1 ; R½m; 1; 1 R½n; 1; 1 ; L ½m; 1; 2 L½n; 1; 2 ; R½m; 1; 2 R½n; 1; 2 ;. . . ; L½m; 1; c L½ n; 1; c ; R½m; 1; c R½n; 1; c ; . . . ; L½m; f ; 1 L½n; f ; 1 ; R½m; f ; 1 R½n;f ; 1 ; L½m; f ; 2 L½n; f ; 2 ; R½m; f ; 2 R½n;f ; 2 ; . . . ; L½m; f ; c L½n; f ; c ; R½m; f ; c R½n; f ; c g. That is distance between samples m,n, denoted by dis½m; n ¼ MaxfL ½m; j; p L½n; j; p ; R½m; j; p R½n; j; p where 1 6 j 6 f ; 1 6 p 6 cg. This simple supremum distance is sufficient to measure the dissimilarity between the samples, since our symbolic data is just a set of lines. If two such sets of lines are hardly separated

L. Rangarajan, P. Nagabhushan / Pattern Recognition Letters 25 (2004) 899–910

at left and right ends of all time slices, then in between also they must be close to each other. If two such sets of lines representing samples m and n are significantly separated at left or right end of a time slice for a particular feature j, then in that time slice some of the feature j values of samples m and n must be far from each other. A smaller value of the above distance implies there is not much of a deviation of the respective sets of regression lines and hence the samples themselves are less deviated through out and can be called similar. A bigger value of the above distance implies there exists a feature and a time slice p where the regression lines

903

are so different because those feature values are not close and thus they are dissimilar. The distance between two samples is illustrated in Figs. 3 and 4. Fig. 5 show regression lines of two samples that are more similar when compared to samples in Fig. 6. Lemma 1. The distance measure so defined is a metric. Proof. Let c; f be the number of time slices and features. Let x; y; z be three samples. Recall the distance between x and y denoted by disðx; yÞ is maxfL½x; j; p L½y; j; p ; R½x; j; p R½y; j; p ; where 1 6 j 6 f &1 6 p 6 cg. 1. disðx; yÞ P 0 This is evident since all distances on the left and right ordinates of each time slice, are absolute differences between respective points of intersections. disðx; yÞ ¼ 0 iff x ¼ y:

Fig. 3. Two features of samples m, n.

Fig. 4. Distance between samples m, nðdisðm; nÞÞ.disðm; nÞ ¼ maxfL½m; 1; 1 L½n; 1; 1 ; R½m; 1; 1 R½n; 1; 1 ; L½m; 1; 2 L½n; 1; 2 ; R½m; 1; 2 R½n; 1; 2 ; L½m; 2; 1 L½n; 2; 1 ; R½m; 2; 1 R½n; 2; 1 ; L½m; 2; 2 L½n; 2; 2 ; R½m; 2; 2 R½n; 2; 2 g.

Feature 1

Feature 2

slice 1

slice 2

time

slice 1

slice 2

time

Fig. 5. Regression lines of samples that are ‘similar’. The distances of the regression lines at all the ordinates (at every left and right ends of the time slices, for every feature) are small.

904

L. Rangarajan, P. Nagabhushan / Pattern Recognition Letters 25 (2004) 899–910

Feature 1

Feature 2

slice 1

slice 2

time

slice 1

slice 2

time

Fig. 6. Regression lines of samples that are ‘dissimilar’. Regression lines are far in time slice 2 of feature 1 and in time slice 1 of feature 2 (implying that feature values of the samples in these time slices are not close).

5. Clustering

Feature j z1 y

z2 z3

x time Fig. 7. Regression lines of feature j values of samples x; y and possible shapes of z in a time slice. Both L and R of z1 are outside L’s and R’s of x and y. L of z2 is outside L’s of x and y, where as R of z2 is inside R’s of x and y. Both L and R of z3 are inside L’s and R’s of x and y.

If disðx; yÞ ¼ 0, then all regression lines in time slice are one and the same. This is possible only if data x and y are identical. That is x ¼ y. If x ¼ y, then all regression lines are one and the same and hence disðx; yÞ ¼ 0. 2. disðx; yÞ ¼ disðy; xÞ8x; y. This is true since L½x; j; p L½y; j; p ¼ L½y; j; p L½x; j; p and R½x; j; p R½y; j; p ¼ R½y; j; p R ½x; j; p 8j; p. 3. disðx; yÞ 6 disðx; zÞ þ disðz; yÞ. Fig. 7 shows lines of three samples x; y and various possible line shapes of z labeled as z1 , z2 and z3 for a feature in a particular time slice. In this time slice lines z1 and z2 satisfy the relation disðx; yÞ < disðx; zÞ þ disðz; yÞ where as line z3 satisfies the relation disðx; yÞ ¼ disðx; zÞ þ disðz; yÞ. In all time slices this is true. Hence disðx; yÞ 6 disðx; zÞ þ disðz; yÞ. h

We have done a partitional clustering which is a combination of the methods by Forgy and MacQueen (Anderberg, 1973). The selection of initial seed points or initial cluster centres is done as suggested by Ball and Hall (Anderberg, 1973) in the following steps. Ball and Hall’s selection of initial seed points does not require a priori knowledge about the data and is simple and often found to have been used in literature. Algorithm 2. Finding seed points (seed lines) of the data set. Input: Data in the form of regression lines. Output: All seed points (centroids) of clusters (the number of clusters determined by the algorithm) 1. Take the overall mean of the data as first seed point. 2. Select subsequent seed points by examining the data units in their input sequence. Accept any data unit, which is at least at some specified distance, DIS, from all previously selected seed points. 3. Perform 2 until no more seed points can be found. For our data set, which is a set of regression lines, the initial seed point is to be selected as the line joining average of L½i; j; p , and the average of R½i; j; p where 1 6 i 6 s, 1 6 j 6 f , 1 6 p 6 c. Mean is a set of lines joining the averages of L values (on an ordinate) and R values (on the ordinate at the right end of the same time slice). In the simple

L. Rangarajan, P. Nagabhushan / Pattern Recognition Letters 25 (2004) 899–910

Feature 1

Feature 2 ^

^ * ^

905

^ *

^ *

* * ^

*

^ *

^ * ^ *

^ *

^

*

*

^ time

sample m

time

sample n

centre

Fig. 8. The ‘centre lines’ for the cluster with 2 samples m and n.

example of Fig. 8 where s ¼ 2, f ¼ 2, t ¼ 6, c ¼ 2, we have illustrated the computation of the ‘centre lines’. Suppose that the centre is denoted by the vector C½i; j where i is the centre or the seed point number and j is the component number. For performing step 2 of Algorithm 2 we need distance between a centre and a sample (both are sets of lines). For example when f ¼ 2, t ¼ 6, c ¼ 2, distance between centre 1 and sample 1 is defined as MaxfL½1; 1; 1 C½1; 1 ; R½1; 1; 1 C½1; 2 ; L½1; 1; 2 C½1; 3 ; R½1;1; 2 C½1; 4 ; L½1; 2; 1 C½1; 5 ; R½1; 2; 1 C½1;6 ; L½1; 2; 2 C½1; 7 ; R½1; 2; 2 C½1; 8 g. In general distance between centre m and sample n is nothing but MaxfL½n; j; p C½m; 2k  1 ; R½n; j; p C½m; 2k where 1 6 j 6 f ; 1 6 p 6 c; 1 6 k 6 c  f g. After selection of initial seed points, initial partition is generated using Forgy’s method. We have selected Forgy’s for initial partition, since a coarse partition will suffice to start with. Algorithm 3. To find initial partition. Input: Data in the form of regression lines, all seed points of clusters. Output: Initial cluster of data sets. 1. Allocate each data unit to the cluster with the nearest seed point. (The seed points remain fixed in this process.) 2. Compute new seed points as centroids of clusters. Step 2 of Algorithm 3 computes centroids of clusters, as follows. Let LR be the regression lines of the samples for various time slices. New centroid for a particular time slice is the line joining average of L’s and the average of R’s of that slice.

For obtaining refinement of the initial partition and final cluster MacQueen’s k-means method is used. The number of refinements is reduced comparatively with MacQueen’s method of re computing the seed points whenever there is a change in cluster assignment. Algorithm 4. To find final clusters. Input: Initial cluster. Output: Final cluster. 1. Start with initial partition described in the Algorithm 3 and the new centroids as seed points. 2. Determine the cluster of each data unit to be that with the nearest seed point. Re compute new seed points whenever data is added to or deleted from a cluster. 3. Perform 2 until no data change its cluster membership.

6. Experiments and results 6.1. Temperature data Average of daily minimum and maximum temperatures of each month of 37 cities all over the globe is considered to be the data with two features (maximum, minimum) and having 12 time steps (Jan; Feb; . . . ; Dec). A gist of the data is given in Table 1. This data can be found in any good engagement book (dairy). This is a typical pattern recognition problem with many temporal frames. Several classes are present in the data, since the

906

L. Rangarajan, P. Nagabhushan / Pattern Recognition Letters 25 (2004) 899–910

Table 1 Average temperature (in Celsius) data of experiment in Section 6.1 City

Month Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep

Oct

Nov

Dec

Amsterdam Min Max

)4 4

)5 3

2 12

5 15

7 17

10 20

10 20

12 23

10 20

5 15

1 10

)1 4

Madras Min Max

20 30

20 31

22 33

26 35

28 39

27 38

26 36

26 35

25 34

24 32

22 30

21 29

)11 9

)8 15

)7 18

)1 21

2 27

6 30

10 31

8 25

5 23

3 22

0 19

)11 15

Zurich Min Max

data covers many zones (tropical, temperate, equatorial, frigid) on the globe with varied temperatures. Experiments were performed with different numbers of time slices (2,3,4) and with various values of ‘DIS’ (distance between seed points). In general all resulted in satisfactory classification. The number of clusters obtained is insensitive to number of slices. In other words for a given ‘DIS’ (distance between centers) the number of clusters is the same with various number of slices. In all cases six clusters (labeled as 0–5 in Table 6) are found. The number of iterations needed for convergence is more or less the same. The average association of the clusters is higher if the slice length is more. Summary of the results, of the experiments with various lengths of the time slices, obtained may be found in Table 2. Cities those are hot and humid like Madras, Calcutta, Bombay, Kula Lumpur, Colombo, Singapore and such cities formed a cluster. Places that are very cold such as Moscow, Munich, and Stockholm were grouped into a cluster. Cities like Frankfurt, Zurich where minimum is low and maximum is high are clustered

together. The clusters obtained for various lengths of the time slices are given in Table 6. We performed a cluster validation by computing ‘association’ (measure of nearness of elements in a cluster) and ‘disassociation’ (measure of distance between clusters). Association of a cluster ¼ C=S, where C and S are defined as follows. C ¼ compactness of the cluster ¼ 1=D, where D is the average distance of cluster elements from the cluster center. That is if dk is the distance of the kth element from the cluster center, then D ¼ Rdk =N , where N is the size of the cluster and S ¼ standard deviation pof cluster elements from the cluster center ¼ ( Rdk2 =N ). Disassociation between clusters i and j ¼ Dij =ððDi þ Dj Þ=2Þ, where D’s are as given below. Dij ¼ Distance between cluster centers, Di ; Dj are average distances of cluster elements from their respective centers (already defined). For a good cluster both association and disassociation should be high. We have measured the obtained clusters using a cluster tendency index (cti) ¼ average association (of all clus-

Table 2 Results of clustering with varied lengths of time slices Number of time slices

New dimension/ original dimension

Average association

Average disassociation

Cti

Number of iterations for convergence

2 3 4

8/24 12/24 16/24

0.133 0.105 0.00952

0.26 0.76 0.06

0.393 0.865 0.06952

3 3 2

L. Rangarajan, P. Nagabhushan / Pattern Recognition Letters 25 (2004) 899–910

ters) + average disassociation (between all pairs of clusters). Table 2 gives a summary of the ‘association’ and ‘disassociation’ obtained for clusters with 3 different numbers of time slices. These measures are very much dependent on the transformation. For example, when the transformation performed brings the data closer, the ‘association’ is more on the transformed data than on the original data and ‘disassociation’ is less on the transformed data than on the original data. The differences in these measures are perhaps because of the transformation into regression lines of various sizes. However Table 6 indicates that there is not an alarming change in the clusters obtained. Table 3 summarizes these numbers for different methods of transformation of data. High association and disassociation is an indication of a good cluster. Our method has a high association compared to other two methods. Relative scaling of ‘association’ and ‘disassociation’ indices have been purposefully done to project the relative strengths of the methods and also possibly to overcome the

907

non-uniform spread of feature values in the resulting transformation space. 6.2. Corn soybean data This is multispectral, multitemporal data (Nagabhushan, 1988) collected to discriminate corn and soybean fields. This is typically a remote sensing application. This data has been tested with various dimensionality reduction methods in (Nagabhushan, 1988). This is simulated Thematic Mapper (TM) data obtained using helicopters from sites in Webster county, Iowa, with the objective of mainly characterizing corn and soybean field. The data is available on unequal intervals of time, namely June 11, June 29, July 16 and August 30 of 1979, in six TM bands. The time interval was suitably scaled. Incidentally our method of representation of data with regression lines also provide for interpolation or prediction of missing data. The data has six features and four time steps and a portion of this is given in Table 4. The classes are previously known to be corn or soybean field. 31

Table 3 Comparison of various methods Method

Dimension

Association

Disassociation

Cti

Proposed Principal components Raw data

12 24 24

0.815 0.242 0.351

0.25 0.91 0.477

1.065 1.152 0.828

*

Actual number of features ¼ 2 (number of features) · 12 (time steps) ¼ 24. Number of time slices ¼ 3 (4 time frames in each slice). Number of lines ¼ 2 (number of features) · 3 (time slices) ¼ 6. Reduced number of features ¼ 6 (lines) · 2 (per line). Number of clusters obtained in all cases ¼ 6.

Table 4 A portion of the table of values of experiment in Section 6.1 Vegetation

Date

Feature values 1

2

3

4

5

6

Corn

Jun 11 Jun 29 Jul 16 Aug 30

3.7 2.1 2.2 2.2

4.8 3.4 3.6 3.7

5.7 2.8 2.5 3.1

10.3 17.3 35.7 32.5

23.0 13.5 13.8 12.6

19.8 6.4 5.1 4.3

Soybean

Jun 11 Jun 29 Jul 16 Aug 30

3.3 2.2 2.5 2.2

3.9 3.5 4.4 4.5

4.4 3.1 3.3 3.2

7.9 15.7 40.1 43.5

18.6 14.7 21.7 19.0

16.9 8.2 10.2 7.0

908

L. Rangarajan, P. Nagabhushan / Pattern Recognition Letters 25 (2004) 899–910

Table 5 Comparison of various methods Method

Dimension

Actual number of items misclassified

C

Proposed Principal components Raw data

12* 24 24

4 out of 60 1 out of 60 13 out of 60

0.8 0.9 0.4

*Actual number of features ¼ 6 (number of features) · 4 (time frames) ¼ 24. Number of time slices ¼ 1 (all 4 time frames in a single slice). Number of lines ¼ 6 (number of features) · 1 (number of time slices) ¼ 6. Reduced number of features ¼ 6 (lines) · 2 (per line) ¼ 12.

Table 6 Table of clusters obtained with various numbers of slices Name of the city

Cluster label when number of time slices ¼ 2

Cluster label when number of time slices ¼ 3

Cluster label when number of time slices ¼ 4

Amsterdam Athens Bahrain Bombay Cairo Calcutta Colombo Copenhagen Dubai Frankfurt Geneva Hong Kong Kuala Lumpur Lisbon London Madras Madrid Manila Mauritius Mexico City Moscow Munich Nairobi New Delhi New York Paris Rome San Francisco Seoul Singapore Stockholm Sydney Tehran Tokyo Toronto Vienna Zurich

2 0 1 1 0 1 1 3 1 2 2 0 1 0 2 1 0 1 4 1 3 3 4 1 3 2 0 0 2 1 3 4 5 0 3 2 2

3 0 1 1 1 1 1 3 1 2 3 1 1 0 0 1 0 1 4 1 3 3 4 1 0 0 0 0 0 1 3 4 5 0 3 3 2

3 0 1 1 1 1 1 3 1 2 3 0 1 0 0 1 0 1 4 1 3 3 4 1 0 0 0 0 2 1 3 4 5 0 3 3 2

Note: For all these experiments the distance between cluster centers is 15, and the number of clusters obtained is 6 (labeled as 0–5). The cluster labels are local to a method.

L. Rangarajan, P. Nagabhushan / Pattern Recognition Letters 25 (2004) 899–910

samples are from corn field and 29 from soybean field. A single regression line for a feature (six regression lines for a sample) resulted in a very good classification. The misclassification is 6.6% (4(soybean) out of 60 is identified as corn field). Unlike the experiment in Section 6.1, classes of this data are known a priori. Hence we computed Hubert C statistic (Jain and Dubes, 1988) using contingency table for the obtained and known partition. Table 5 gives summary of the results obtained from various experiments on this data. Often in multidimensional data features are redundant. Transformation and reduction of features often results in a good classification (Nagabhushan, 1988). Corn soybean data is no exception to this. A glance at C values in Table 5 provides a justification for transformation and reduction of features. 6.3. Computational complexity Let n, f , and t be the number of samples, features, and time steps. PCA: Number of features is f  t ¼ g.

Step

Complexity

1. Finding variance, covariance of features 2. Finding eigen values and eigen vectors of feature matrix 3. Finding principal components 4. Computation of distance between a pair of samples (If dimensionality reduction is not done. Observe that dimensionality reduction requires some more steps)

Oðn  gÞ, Oðn  g2 Þ Oðg4 Þ Oðn  g2 Þ

OðgÞ

Our method: Number of features ¼ 2  c  f ¼ h, where c is the number of time slices. Step

Complexity

1. Finding regression lines 2. Computation of distance between a pair of samples

Oðn  c  f Þ OðhÞ

909

It is evident that the transformation to regression lines is far simper when compared to finding principal components. The distance computation during clustering is less complex in our method because of lesser number of features. Even in the worst case when the length of the time slice is 3 (<3 does not require regression line identification), we achieve reduction in dimension by 1/3.

7. Conclusion The problem of representation of multidimensional temporal values (as symbolic objects called regression lines) and hence reducing the dimension of the data is addressed in a novel way. The temporal values of a feature are represented as a set of regression lines. Thus feature values are no more just ordinal values (difficult to understand) but transformed to new shorter sets of units namely regression lines (temporal data assimilated in the form of lines). Experiments with real data have demonstrated the ability of this method to produce natural classifications, which is the ultimate goal of multidimensional data. It is evident from the cluster validation indices obtained, that the method proposed is comparable to PCA. But the computational effort needed is far less compared to PCA. Potential applications of this method include areas such as remote sensing, medicine, agriculture and computer vision where multidimensional (spectral––in case of remotely sensed data) observations are encountered. For example in case of remotely sensed images f  t spectral values of each pixel can be reduced to f  ðt=cÞ regression lines (c being the length of the time slice) and clustering could be performed on these transformed set of lines. The primary drawback of the method is that it forces a line upon a possibly nonlinear data. To circumvent this problem, a method for finding natural linear segments present in the feature values could be devised and then regression lines in the segments could be identified. If this is done a natural classification is created among the samples as soon as regression lines are fitted. The number of lines and the time slices should be identical if

910

L. Rangarajan, P. Nagabhushan / Pattern Recognition Letters 25 (2004) 899–910

two samples are to be called as ‘similar’. This creates a classification initially and could be fine tuned later. Another inherent problem of our symbolic representation is that if 2 classes are subtly different the regression lines tend to reduce the existing differences. The method proposed is good when classes are separable at least by one feature.

References Anderberg, M.R., 1973. Cluster Analysis for Applications. Academic Press, New York. Bock, H.H., Diday, E. (Eds.), 2000. Analysis of Symbolic Data. Springer Verlag, p. 2000. Dunham, J.G., 1986. Piecewise linear approximation of planar curves. IEEE Trans. PAMI 8. Gettler-Summa, M., 1999. MGS in SODAS, Cahiers du CEREMADE no. 9935, Universite’ Paris IX Dauphine, France. Gowda, K.C., Nagabhushan, P., 1989. Stereo graphic projection based dimensionality reduction of remotely sensed multidimensional data. In: Internat. Seminar on Image Process., Singapore, November 1989.

Ichino, M., Yaguchi, H., 1994. Generalized Minkowski metrics for mixed feature type data analysis. IEEE Trans. Systems Man Cybernet. 24 (4). Jain, A.K., Dubes, R.C., 1988. In: Algorithms for Clustering Data. Prentice Hall, p. 1988. Jolliffee, I.T., 1986. Principal Component Analysis. Springer Verlag, New York. Leung, M.K., Yang, Y.-H., 1990. Dynamic strip algorithm in curve fitting. Comput. Vision Graphics Image Process. 51. Lillesand, T.M., Keifer, R.W., 1992. In: Remote Sensing and Image Interpretation. John Wiley & Sons, New York, p. 1992. Linganagouda, K., Nagabhushan, P., Gowda, K.C., 1994. A new approach for feature transformation to Euclidean space useful in the analysis of multi spectral data. IEEE IGRASS94, 1994. Nagabhushan, P., 1988. An efficient method of classifying remotely sensed data (incorporating dimensionality reduction), Ph.D. thesis, University of Mysore, 1988. Rangarajan, L., Nagabhushan, P., 2000. A new method for pattern classification through dimensionality reduction based on regression analysis. In: Proc. Indian Conf. Computer Vision, Graphics Image Process., December 20– 22, 2000. Srikantaprakash, H.N., Nagabhushan, P., Gowda, K.C., 1997. Symbolic data analysis of multi spectral temporal data. In: IEEE Internat. Geosci. Remote Sensing Symposium, Singapore, August 3–8, 1997.