- Email: [email protected]

Pattern Recognition Letters Pattern Recognition Letters 16 ( 1995) 219-223

El ~%EV1ER

Dimensionality reduction of symbolic data P. Nagabhushan a,., K. Chidananda Gowda a, Edwin Diday b " Dept. of Computer Sc. & Engg., SJ College of Engg., Mysore 570 006, India b INRIA, Rocquencourt, Domaine de Foluceau, BP 105, 78153 Le Chesnay Cedex, France

Received 1 June 1994; revised 23 August 1994

Abstract

Hitherto dimensionality/feature reduction techniques are studied with reference to conventional data, where the objects are represented by numerical vectors. This proposal is to extend the notion of dimensionality reduction to more generalised objects called Symbolic data. A mathematical model which achieves generation of symbolic features - particularly of span type - in transformed lower-dimensional space from a high n-dimensional feature space of span type symbolic data, is presented in this paper. This work is expected to open a new avenue in the area of symbolic data analysis. Keywords: Dimensionality/feature reduction; Symbolic data; Symbolic "span" feature; Feature extremals; Worst case span; Sensitivity

index

I. Introduction

In conventional data analysis, the objects are numerical vectors. The "length" o f such vectors depends upon the number of features recorded, leading to the creation o f a multi-dimensional feature space. The larger the dimensionality is, the more severe are the problems o f storage and analysis. Hence a lot o f importance has been attributed to the process of dimensionality or feature reduction, which is achieved by subsetting or transformation methods (Jain and Dubes, 1988; Nagabhushan, 1988; Nagabhushan and Sudhanva, 1989; Sudhanva and Gowda, 1992). It is beyond the scope o f this paper to describe the impor~r This work is based on the research work on IFCPAR project no. 912-1, supported by Centre Franco-lndien pour la Promotion de la Recherche Avanc6e (Indo-French Center for the Promotion of Advanced Research). * Corresponding author.

tance and methodologies of various dimensionality reduction procedures. Details can be found in (Jain and Dubes, 1988; Nagabushan, 1988). The issue taken up here is to extend the notion of feature/dimensionality reduction to Symbolic data, in particular symbolic data o f span type. Symbolic objects are extensions of classical data type. In conventional data sets, the objects are "individualised", whereas in symbolic objects they are "unified" by means o f relationships. Various definitions and descriptions of symbolic objects are given in (Diday, 1988, 1993; Gowda and Diday, 1991, 1992). Features characterizing a symbolic object may take more than one value or interval of values or may be qualitative. In real life, quite often we come across features o f interval/duration/spread/span type. Some such case studies can be found in (Gowda and Diday, 1991, 1992; Nagabhushan, 1993). In view of the importance o f this particular type of symbolic data, we have aimed to provide span type of sym-

0167-8655/95/$09.50 © 1995 Elsevier Science B.V. All fights reserved SSDI0167-8655(94)00085-9

220

P. Nagabhushan et aL / Pattern Recognition Letters 16 (1995) 219-223

bolic features in lower-dimensional space starting from the symbolic span type of features in higher-dimensional space. A mathematical model is devised to achieve the proposed target. A detailed computational procedure is presented showing the transformation of 3-D span symbolic features to 2-D span symbolic features. An illustration is also provided. The origin in developing this mathematical model can be traced to the computational model proposed by Calahan (1972) for the worst case tolerance analysis of electrical networks.

2. Statement of the problem

-I

B

(P ,0, R)

°,Q°,R °)

/

_p

R

There are m samples in n-dimensional space. Each feature f~ of sample j is of symbolic span type, i.e. fu = (J~0,f0) where 1 <~j<~rn; 1 <~i<~n and fu takes a value in the interval ~ to fu. It is required to transform the n-D span features to k-D span features of type: F o = ( F o, Fo) where l<~j

tx~¥ol

k <

Clearly F 0 or simply Ft= Tt(fl, f2 .... , fn). Here T represents a feature transformation function. Spherical space projection, stereographic projection, triant projection are the T's in (Nagabhushah, 1988 ), different types of geometric projections are the T's in (Nagabhushan and Sudhanva, 1989; Sudhanva and Gowda, 1992), triangulation projection is one of the T's in (Jain and Dubes, 1988) and T can also be the function generating principal components. In brief, the objective is to obtain the transformed k-D feature vector Aj = IF1 = (F_!, F1 ) ] A [F2 = (F2, F2 ) ] A'" A [ F k = ( F k , Fk) ]

from the given n-D feature vector If,

=

^

=

^'"^

for thejth symbolic object. Here k<< n, 1 ~j~< m and m is the number of symbolic objects. For instance, if n = 3 and k = 2 then the aim is to map the symbolic object cube (cuboid) in 3-D feature space (a hypercube in case of n-D feature space)

Fig. 1. Graphicalrepresentationof symbolicfeatures.

to a square (rectangle) in 2-D transformed space as shown in Figs. 1 and 2, respectively.

3. Nominal feature values The nominal feature value of ~,f~ ) isf~ such that

4. Model building The transformed feature vector is [Ft] = [ (F!, Fl ) ] where FI ~

P. Nagabhushan et al. / Pattern Recognition Letters 16 (1995) 219-223 40

221

L

'TRANSFORM.DAT'

--

Fig.2 An Example

30

20

10

-10

-20

-30 -15

t -10

I -5

I

I

5

10

15

Fig. 2. An example.

[6] = T t ~ , A .... , L ) .

(4.1)

The span in Ft is due to the individual spans i n f values. Hence the net span in Ft can be expressed by partial derivatives

(; OTt Ac

AFt=

i~=l~i

t-IJi "

(4.2)

Each partial derivative represents the sensitivity index,

Ft__ OTt s~-

(4.3)

of .

The sensitivity index is defined as the ratio of incremental change in the output feature value (Ft) due to an incremental change in the input feature value ~ ) . This is in fact the first-order sensitivity index. Higher-order sensitivity indices can also be defined. Eq. (4.2) can be rewritten as

case extremal in the positive direction, AFt+, with reference to FT, each product term of (4.4) has to be positive. However Sf/ may be positive or negative depending upon T. Accordingly Af has to be positive or negative respectively. Af is made positive by defining A f = ~ - f 7 ) and made negative by Af = ~ - f ~ ). Similarly, for the worst case extremal in the negative direction, AFt-, each product term has to be negative. Hence Ft = F~ + AFt +

and Ft = F7 + A F t - .

Thus the transformed feature values are also of symbolic span type.

5. Detailed illustration

5.1. Detailed computational procedure The sequence of steps is as follows.

AFt = L ~'~FJ'Af"si"

(4.4)

i=1

The aim is to evaluate the worst case span in the transformed feature values. To compute the worst

1. Input is a finite set of symbolic objects. For simplicity we have considered symbolic objects in 3D feature space where thejth object is

222

P. Nagabhushan et aL /Pattern Recognition Letters 16 (1995) 219-223

a j = ~ = ( e , P ) ] " ~ = (q, q ) ] ^ If3 = (_r, r ) ] .

Let the corresponding nominal values for (f~, f2, f3) be (p°, q°, r ° ). This is illustrated in Fig. 1. 2. [X °, Y° ] is obtained by using a transformation T on [p°, q°, r ° ]. Here [X °, Y° ] is the transformed nominal 2-D feature vector. 3. Simulate perturbations in p column: 3.1. Perturbations in positive direction: Define perturbed feature vector [p-new>p ° , q°, r ° ] (for computational purposes p-new= p° + 10% o f p ° ).

Hence obtain [X-new, Y-new]. Compute S~+ = ( X - n e w - X ° ) / ( p - n e w - p ° ) S~+ = ( Y-new - Y° ) / ( p - n e w - p ° ). 3.2. Similarly simulate perturbations in negative direction by defining p - n e w < p ° and thus compute S~_ and SPY_. 4. Repeat 3.1 and 3.2 to compute S~÷, S~_, S~+,

S~_; S~+, S,~_, SY,+,SC. 5. The worst case span in positive direction is AX+ =S~,+ (p-p°) + S~+ (q-q°)SX+ ( r - r ° ). Here each product term is maintained positive as explained earlier.

Table 1 Input and transformed symbolic features Sample

6 7 8

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

Input 3-D symbolic features 23.40,28.60][22.50,27.50][90.00,110.00] 22.50,27.50][23.40,28.60][90.00,110.00] 21.60,26.40][22.50,27.50][90.00,110.00] 22.50,27.50][21.60,26.40][90.00,110.00] 22.50,27.50][22.50,27.50][90.00,110.00] 22.50,27.50][22.50,27.50][90.90,111.09] 22.50,27.50][22.50,27.50][89.09,108.90 10.80,13.20][22.50,27.50][90.00, 110.00 9.90,12.10][23.40,28.60 [90.00,110.00 9.00,11.00][22.50,27.50 [90.00,110.00 9.90,12.10][21.60,26.40 [90.00,110.00 9.90,12.10][22.50,27.50 [90.00,110.00 9.90,12.10][22.50,27.50 [90.90,111.09 9.90,12.10][22.50,27.50 [89.09,108.90 10.80, 13.20][ 9.90,12.10 90.00,110.00 9.90,12.10] 10.80,13.20 90.00,110.00 9.00,11.00] 9.90,12.10 90.00,110.00 9.90,12.10] 9.00,11.00 90.00,110.00 9.90,12.10] 9.90,12.10 90.00,110.00 9.90,12.10] 9.90,12.10 90.90,111.09 9.90,12.10] 9.90,12.10 89.09,108.90 23.40,28.60] 9.90,12.10 90.00,110.00 22.50, 27.70] 10.80, 13.20 90.00,110.00 21.60,26.40] 9.90,12.10 90.00,110.00 22.50,27.50] 9.00,11.00 90.00,110.00 22.50,27.50] 9.90,12.10 90.00,110.00 22.50,27.50] 9.90,12.10 90.90,111.09 22.50,27.50] 9.90,12.10 89.09,108.90 23.40,28.60] 22.50,27.50 81.00, 99.00 22.50,27.50] 23.40,28.60 81.00, 99.00 21.60,26.40122.50,27.50 81.00, 99.00 22.50,27.50] 21.60,26.40 81.00, 99.00 22.50,27.50] 22.50,27.50 81.00, 99.00 22.50,27.50] 22.50,27.50 81.90,100.09 122.50,27.50] 22.50,27.50 80.09, 97.90

Transformed 2-D symbolic features [ 6.08, 8.65][-0.67, 2.17] [ 6 . 0 8 , 8.64][ -3.71, 2.86] [ 4 . 9 4 , 7.16] [ -3.39, 2.45] [ 4 . 9 4 , 7.17][-0.66, 2.16] [ 5.51, 7.90][-1.43, 1.95] [ 5.08, 7.71][-1.22, 1.77] [ 5.94, 8.10][ -1.64, 2.12] [-6.41, 3.12][-26.91, 9.51] [ -7.08, 3.87][-31.15, 10.98] [ -8.40, 2.51][-30.84, 10.97] [-7.72, 1.76][-26.59, 9.51] [ -7.40, 2.81][-28.87, 10.24] [ -7.81, 2.56][-28.67, 10.24] [ -6.99, 3.07][-29.08, 10.24] [-12.55,-10.30][-3.82, 5.60] [-12.54,-10.281[-2.40, 1.36] [-14.06,-11.47][-2.74, 1.76] [-14.08,-11.49][-4.16, 6.00] [-13.31,-10.89][-3.28, 3.68] [-13.65,-11.27][-3.49, 3.85] [-13.07,-10.41][-3.07, 3.51] [ -7.04, 3.76][-11.35,32.51] [ -6.41, 3.22][-10.03,29.14] [ -7.68, 1.67][ - 10.28, 28.67] [ -8.35, 2.40][-11.69,32.91] [ -7.36, 2.71][-10.81,30.59] [ - 7.76, 2.461[ - 11.02, 30.76] [ -6.99, 2.971[-10.60,30.42] [ 9.45, 11.67][-1.55, 3.19] [ 9.47, 11.69][-5.80, 4.61] [ 8.29, 10.38][ -5.47, 4.20] [ 8.28, 10.36][ -1.22, 2.78] [ 8.87, 10.881[-3.51, 3.70] [ 8.59, 10.53][ -3.30, 3.52] [ 9.16, 11.24][-3.72, 3.87]

P. Nagabhushan et aL / Pattern Recognition Letters 16 (1995) 219-223

Similarly the worst case span in negative direction is

AX--Sp_(p-p°)+Sq_(q-q°)+S,_(r-r --

x

X

x

o

),

with each term being maintained negative. In a similar way AY+ and A y - are computed. The span types of transformed 2-D feature values, therefore, are X=X°+AX+,

_X=X°+AX - ;

I?= yo + A y + ,

Y_=yo+Ay-.

This is Aj= [F1 = ( X_, X) ] ^ [F2= ( _Y, 17) ]. Fig. 2 illustrates this. 6. The steps are repeated for all samples.

223

sional k-D symbolic span version (k << n ). Dimensionality reduction of symbolic data should find applications in symbolic data analysis like feature reduction in conventional data analysis. The authors expect that this will be a new direction in symbolic data analysis, as this is a new thinking in the direction of dimensionality reduction of symbolic objects.

Acknowledgement Assistances rendered by Mr. T.V. Ravi, Mr. N.S. Nagaraj, S.J.C.E. Mysore and Ms. C. Ahlam, INRIA, Rocquencourt are appreciated.

5.2. An example For the purpose of illustration, a slightly modified version of 3-D data of 35 samples, used in computer simulation experiments in (Nagabhushan, 1988; Nagabhushan and Sudhanva, 1989; Sudhanva and Gowda, 1992; Gowda and Diday, 1992) is considered. Table 1 shows the input 3-D symbolic data set (span type). The data set has 5 classes each of 7 samples. This 3-D data is transformed to 2-D symbolic data by generating the first two principal components. (The transformation function T chosen is the principal component analysis. ) Table 1 provides the span type of features in the dimensionality-reduced feature space. Fig. 2 shows graphically the transformed 2-D symbolic objects. There are 5 objects (samples 3, 8, 15, 22, 29) in this graph, each one being a sample chosen from each of the five classes present in the data set.

6. Conclusion In this paper a mathematical model is divised enabling the dimensionality reduction of n-D symbolic data, particularly of span type, to a lower-dimen-

References Calahan, D.A. (1972). Computer Aided Network Design. McGraw-Hill, New York. Diday, E. (1988). The symbolic approach in clustering, classification and related methods of data analysis. In: H.H. Bock, Ed., Classification and Related Methods of Data Analysis. North-Holland, Amsterdam. Diday, E. ( 1993 ). From data to knowledge, boolean, probabilistic, possibilist and belief objects for symbolic data analysis. Tutorial at IV Conf. IFCS. Gowda, K.C. and E. Diday ( 1991 ). Symbolic clustering using a n e w dissimilarity measure. Pattern Recognition 24 (6). Gowda, K.C. and E. Diday (1992). Symbolic clustering using a new similarity measure. IEEE Trans. Syst. Man Cybernet. 22 (2). Jain, A.K. and R.C. Dubes (1988). Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs, NJ. Nagabhushan, P. (1988). An efficient method for classifying remotely sensed data, incorporating dimensionality reduction. Ph.D. thesis, Mysore University, India. Nagabhushan, P. (1993). Recognition of print - numerals using symbolic data analysis, IV Conf. IFCS. Nagabhushan, P. and D. Sudhanva (1989). A new technique for image compression and display using geometric projections. EI-89, Vol. 1. Sudhanva, D. and K.C. Gowda (1992). Dimensionality reduction using geometric projection: a new technique. Pattern Recognition 25 (8).