- Email: [email protected]

WeS3T3.3

An Information Theoretic Approach to Constructing Machine Learning Criteria K.R. Chernyshov V.A. Trapeznikov Institute of Control Sciences, 65 Profsoyuznaya, Moscow 117997, Russia (e-mail: [email protected]) Abstract. Selecting a learning criterion is a constituent part of a machine learning problem statement requiring both accounting its adequacy to the data available and practical suitability of implementation. The paper presents an approach to the machine learning in accordance to information-theoretic criteria that are derived basing on the Rényi entropy of an arbitrary order. Meanwhile, a parameterized description of the machine learning is utilized combined with a corresponding technique of estimation of mutual information constructed basing on the Rényi entropies. This leads, finally, to a problem of the finite dimensional optimization to be solved by a suitable technique. The consideration proposed is preceded by a thorough review of existing information theoretic and entropy based approaches to the machine learning. The paper has been supported by a grant of the Russian Foundation for Basic Researches (RFBR): project 12-08-01205-a. Keywords: Machine learning, Information-theoretic criteria, Measures of dependence, Rényi entropy which the functions correspond to, of the stochastic kernel formed by the joint and marginal probability distribution densities (Sarmanov, 1963).

1. INTRODUCTION Majority of cases of solving a machine learning problem implies applying a measure of dependence of random values (processes). Among the measure of dependence, conventional correlation and covariance once are the most widely used. Theirs application is directly implied from the problem statement itself, based on the mean squared criterion. A main advantage of the measures is convenience of theirs use involving both a possibility of deriving explicit analytical expressions to determine the required characteristics and relative simplicity of constructing theirs estimates involving those of based on observation of dependent data. However, the main disadvantage of the measures of dependence based on linear correlation is the fact that these may vanish even provided that there exist a deterministic dependence between the pair of the investigated variables.

Thus, a constructive way of applying the information theoretic criterion that, meanwhile, would not be based on using restrictive preliminary limitations, may not be based on direct utilizing an analytical expression of the corresponding probability distribution densities. Hence, a principal feature of such a constructive method is applying of suitable estimates of the information theoretic criterion built by use of sample data, rather than its analytical expression. Learning theory develops models from data in an inductive framework. It is therefore no surprise that one of the critical issues of learning is generalization. But before generalizing the machine must learn from the data. How an agent learns from the real world is far from being totally understood (Principe, 2010). The most developed framework to study learning is perhaps statistical learning theory (Vapnik, 1998), where the goal of the learning machine is to approximate the (unknown) a posteriori probability of the targets given a set of exemplars. But there are many learning scenarios that do not fit this model (such as learning without a teacher). Instead one can think that the agent is exposed to sources of information from the external world, and explores and exploits redundancies from one or more sources. This alternate view of learning shifts the problem to the quantification of redundancy and ways to manipulate it. Since redundancy is intrinsically related to the mathematical concept of information, information theory becomes the natural framework to study machine learning (Principe, 2010).

Just to overcome such a disadvantage, use of more complicated, nonlinear, measures of dependence has been involved. A feature of the technique proposed in the paper is that it is based on application of a consistent measure of dependence. Following to Kolmogorov’ terminology, a measure of stochastic dependence between two random variables is referred as consistent if it vanishes if and only if the random variables are stochastically independent (Sarmanov and Zakharov, 1960). Consistent (in the Kolmogorov’ sense) measures of dependence are based, finally, on a comparison of the joint and corresponding marginal distributions of investigated random values. Meanwhile, such a comparison may be set both in an explicit and implicit form. The former relates to wellknown divergence measures (e.g., Basseville, 2013), the latter may be presented by the maximal correlation (Rényi, 1959). But the main drawback of the maximal correlation is its computation, since it may be implemented in the form of an iterative procedure. This is a procedure of determining the first pair of the first eigenfunctions and the first eigenvalue, 978-3-902823-37-3/2013 © IFAC

In information theoretic learning, the starting point is a data set that globally conveys information about a real-world event. The goal is to capture the information in the parameters of a learning machine, using some information theoretic performance criterion. A typical setup for information theo269

10.3182/20130703-3-FR-4038.00145

11th IFAC ALCOSP July 3-5, 2013. Caen, France

uniform distribution at an interval [a; b] , when both the Shannon and Rényi (of an arbitrary order) entropies have the form ln (b − a ) . Analytical expressions for a large number of univariate and multivariate distributions are presented in papers of Nadarajah and Zografos (2003, 2005), Zografos and Nadarajah (2005).

retic learning is as follows. The system output is given by y (t ) = ψ (θ , x(t ) ) , where x(t ) is the data pattern presented to the system at iteration (time instant) t. The function ψ (⋅, ⋅) represents a possibly non-linear data transformation that depends on a parameter vector θ. Information theoretic learning seems the natural way to train the parameters of a learning machine because the ultimate goal of learning is to transfer the information contained in the external data (input and or desired response) onto the parametric adaptive system. The system may receive external input in the form of a desired response, in which case the system operates in a supervised learning mode (Principe, 2010).

The Rényi entropy also possesses an intrinsic property to interpret its estimation. Let in a space of integrable with degree α non-negative functions f(x), x ∈ R , the norm be defined by the expression 1

⎞α ⎛∞ f α = ⎜⎜ ( f ( x) )α dx ⎟⎟ . ⎟ ⎜ ⎠ ⎝ −∞

Information theoretic learning seems to be the natural way to train the parameters of a learning machine because the ultimate goal of learning is to transfer the information contained in the external data (input and or desired response) onto the parametric adaptive system. The mean squared error criterion (MSE) has traditionally been the workhorse of adaptive systems (Principe, 2010). However, the great advantage of information theoretic criteria is that they are able to capture higher order statistical information in the data, as opposed to the MSE that is a second order statistical criterion. This property is important and successfully applied in numerous machine learning problems that are presented by (Bonev et al., 2013, Garbarine et al., 2011, Kamimura, 2010, Porto-Díaz et al., 2011, Singh and Principe, 2011) and many others.

∫

Expression (2) is referred as α-norm. Then if f(x) is a probability distribution density of a random value X, the Rényi entropy is expressed via its α-norm as Rα ( X ) =

Along with the Shannon’s definition of the entropy that, in turn, leads to the definition of the Shannon mutual information, there are known alternative ways to define the entropy. For a random value X having a probability distribution density f (x) the Rényi entropy of order α is defined as (Rényi, 1960, 1976a,b)

(

)

1 ln E( f ( x) )α −1 , α > 0, α ≠ 1 . 1−α

α ln f ( x) α . 1−α

(3)

Representation (3) of the Rényi entropy is oriented to demonstrate computational advantage of obtaining its estimate versus that of the Shannon entropy. The principal difference is the Shannon entropy definition is considerably based on the probability distribution density while the Rényi entropy relates to the norm of the probability density function, and a quantity, the norm of a vector is generically easier to estimate than the probability distribution density.

2. A MEASURE OF DEPENDENCE BASED ON THE RENYI ENTROPY

Rα ( X ) =

(2)

Among various values of the order α > 0, α ≠ 1 , all may be applied to form the vector norm, but the problem complexity will grow exponentially in α, while application of α = 2 has been reported in literature to have good results (Primcipe, α =2, the quantity 2010). For R2 ( X ) = − ln(E( f ( x) )) = −2 ln f ( x) 2 is referred as the quad-

(1)

ratic entropy. Considerations of the present Section were dealt with the marginal Rényi entropy. Along with the marginal entropy, one may correspondingly define the mutual Rényi entropy of the order (α1 , α 2 ) for a pair of probabilistic measures of a pair of random variables X and Y. with joint probability distribution density and marginal ones. Within the approach, the first probabilistic measure is defined by the probability joint distribution density h(x,y) of the random variables X and Y, the second measure is defined by the product of the marginal probability distribution densities f(x) and g(y) of the pair. The mutual Rényi entropy of the order (α1 , α 2 ) is thus defined in the following manner:

Meanwhile, as α tends to 1, in (1) Rα ( X ) tends to the equation defining the Shannon entropy that thus may be considered as a limit case of the Rényi entropy of “order 1”. From computational point of view, especially under necessity of estimation by use of sample date, the Rényi entropy is recognized as more attractive than that of Shannon since the Rényi entropy involves “logarithm of integral” what is computational simpler than “integral of logarithm” in the case of the Shannon entropy. Meanwhile, selection of a particular value of the order α is of importance since the larger the order is, the more complicated a computational procedure becomes. One also may be noted that the Rényi entropy, for continuously distributed random values, takes its values in the interval (− ∞; ∞ ) as well as the Shannon one; and for some probability distribution densities the Shannon and Rényi entropies may coincide, meanwhile, of course the latter does not depend on the order α. Indeed, this is valid, for instance, for the

Rα1 ,α 2 ( X , Y ) =

( (

α12 + α 22 > 0, α1 + α 2 = α ≠ 1 ,

270

))

1 ln E h (h( x, y ) )α1 −1 ( f ( x) g ( y ) )α 2 , 1−α

11th IFAC ALCOSP July 3-5, 2013. Caen, France

From the condition, infinite number of solutions follow, that may be uniformly describe in the form

where the mathematical expectation is taken over h(x,y). The marginal Rényi entropy is thus a partial case of the mutual one, when either α1 = 0 , or α 2 = 0 . For the first case α1 = 0, α 2 = α , the mutual Rényi entropy is as follows R0,α ( X , Y ) =

⎛ ⎛ ( f ( x) g ( y ) )α 1 ln⎜ E h ⎜ 1 − α ⎜ ⎜⎝ h( x, y ) ⎝

ν ⎧ ⎪⎪c1 = ν , c2 = c3 = − 2 , if α > 1 , ⎨ ⎪c = −ν , c = c = ν , if α < 1 2 3 ⎪⎩ 1 2

⎞⎞ ⎟⎟ , ⎟⎟ ⎠⎠

where ν > 0 . Thus, one may simply set ν = 1 . Summarizing the consideration, the mutual Rényi information of order α has the form

for the second case ( α1 = α , α 2 = 0 ) Rα ,0 ( X , Y ) =

(

)

1 ln E h (h( x, y ) )α −1 . 1−α

α α E h ⎛⎜ (h( x, y ) ) 2 −1 ( f ( x) g ( y ) ) 2 ⎞⎟ 1 ⎠ , ⎝ I Rα ( X , Y ) = ln α −1 α −1 γ (α ) E h (h( x, y ) ) E fg ( f ( x) g ( y ) )

In all these cases the mathematical expectation is taken (formally) over h(x,y). At the same time, R0,α ( X , Y ) does not actually depend on h(x,y), while Rα ,0 ( X , Y ) does not depend

⎧1 − α , if α > 1 where γ (α ) = ⎨ . ⎩α − 1, if α < 1

on f(x) and g(y) at all. So, these should also be considered as marginal entropies of probability distribution densities f ( x) g ( y ) and h(x,y) respectively. These marginal entropies

As well as the Shannon mutual information, I Rα ( X , Y ) takes

will be denoted as Rα ( fg ) and Rα (h) , i.e. Rα ( fg ) =

its values in the interval [0, ∞ ) . Meanwhile, the expression under the sign of logarithm has an evident interpretation as cosine of angle between elements of corresponding Hilbert

)

(

1 ln E fg ( f ( x ) g ( y ) )α −1 = R0,α ( X , Y ) , 1−α

space formed by α integrable functions mapping R 2 into R, where the inner product of its vectors φ1 ( x, y ), φ 2 ( x, y ) is defined by the expression

where the expectation is taken over f ( x) g ( y ) ; and Rα (h) =

(

)

1 ln E h (h( x, y ) )α −1 = Rα ,0 ( X , Y ) , 1−α

∞ ∞

φ1 ( x, y ), φ 2 ( x, y ) =

where the expectation is taken over h(x,y).

∫ ∫ φ1( x, y)φ2 ( x, y)dxdy ,

−∞ −∞

Of course, one should also be noted that Rα (h) = Rα ( f ) + Rα ( g ) as the random values X and Y are stochastically independent.

as well as the Euclidean norm of a vector φ ( x, y ) as ∞ ∞

ϕ ( x, y ) 2 =

Also, regarding non-zero cases, α1 ≠ 0, α 2 ≠ 0 , the “symmetric” one when α1 = α 2 = α

∫ ∫ (φ ( x, y)) dxdy . 2

−∞ −∞

is emphasized and will serve as 2 a basis to obtain mutual Rényi information of order α of random variables.

Thus, following the notations introduced, the mutual Rényi information is written as α

Again, since classical (Shannon) mutual information I(X,Y) of a pair of random variables X and Y has a representation via corresponding entropies of the variables of the form

α

(h( x, y )) 2 , ( f ( x) g ( y)) 2 1 I Rα ( X , Y ) = ln γ (α ) (h( x, y ) )α 2 ( f ( x) g ( y ) )α 2

I ( X , Y ) = − H ( X , Y ) + H ( X ) + H (Y ) ,

2

= 2

α α 1 ⎛ ⎞ ln⎜ cos⎛⎜ (h( x, y ) ) 2 , ( f ( x) g ( y ) ) 2 ⎞⎟ ⎟ . γ (α ) ⎝ ⎝ ⎠⎠

it would be natural to search for the mutual Rényi information I Rα ( X , Y ) of order α of the pair of random variables

=

X and Y in a similar form, e.g.

For a partial case when α = 2 , from the above considerations the so called Cauchy-Schwarz divergence immediately follows

I Rα ( X , Y ) = c1Rα α ( X , Y ) + c2 Rα (h) + c3 Rα ( fg ) , , 2 2

DCS ( X , Y ) =

where c1 , c2 , c3 are normalization coefficients to be chosen to meet the condition: I Rα ( X , Y ) ≥ 0 meanwhile I Rα ( X , Y ) = 0 if the random var-

iables X and Y are stochastically independent. 271

11th IFAC ALCOSP July 3-5, 2013. Caen, France

∞ ∞

= − ln

∫ ∫

y S (t ;θ ) = θ T φ (t ) ,

h( x, y ) f ( x ) g ( y )dxdy

θ = (θ1 , K ,θ n )T .

−∞ −∞ ∞ ∞

∫ ∫ h ( x, y )

∞ ∞ 2

dxdy

−∞ −∞

p S , λ ( y, x) =

2π

⎛ ⎞⎞ ⎞⎛ − 3 y 2 ⎛ − 3 x2 ⎜ ⎟⎟ ⎟⎜ ⎜ 2 2 1 λ 2 e 1 2 e 1 − + − ⎜ ⎟⎟ ⎟ , ⎟⎟⎜⎜ ⎜⎜ ⎟ ⎜ ⎠⎠ ⎠⎝ ⎝ ⎝

column-vector

0.5

α=1.01

To demonstrate the behavior of I Rα ( X , Y ) as a measure of

e

the

are some known functions of preceding values of the input of the system, as well as, generically, preceding values of the system output.

−∞ −∞

x2 + y2 2

of

φ (t ) = (φ1 (t ), K , φ n (t ) )

2

proposed in the fullness of time as a declarative statement in a number of papers, summarized in (Principe, 2010), basing on the Cauchy-Schwarz inequality and disregarding the above condition imposed on the relationship between the mutual Rényi information and corresponding Rényi entropies. Thus DCS ( X , Y ) = I Rα ( X , Y ) as α = 2 .

−

Components T

∫ ∫ ( f ( x) g ( y)) dxdy

dependence, one may consider the following example based on applying a bivariate distribution density the O.V. Sarmanov’ class of distributions (Kotz et al., 2000). In accordance to the properties of the O.V. Sarmanov’ distributions, one may derive the following probability distribution density

(4)

0.375

α=1.1 ⎤⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦⎦

λ ≤1.

0.25

0.125

α=4

For the density pS , λ ( x, y ) , both ordinary correlation and correlation ratio are equal to zero, although the random values are stochastically dependent, in particular, the maximal correlation coefficient for the density pS , λ ( x, y ) is

(

S (Y , X ) = 4

α=2

α=0.4 0 −1

)

− 0.5

0

0.5

1

7 − 1 λ . The parameter λ considerably influ-

ents the shape of the density pS , λ ( x, y ) . In Figure 1, the dependence of the values of I Rα ( X , Y ) , corresponding to the

Fig. 1: the dependence of the values of I Rα ( X , Y ) , corresponding to the density p S , λ ( x, y ) , as a function of the parameter λ ≤ 1 under magnitudes of the order α.

density pS , λ ( x, y ) , as a function of the parameter λ ≤ 1 under various magnitudes of the order α is graphically presented.

Within the problem statement, the system parameters, that is the column-vector θ components are to be determined in accordance to the information theoretic criterion

3. MACHINE LEARNING PROBLEM WITH AN INFORMATION-THEORETIC CRITERION

I Rα {y DR (t ), y S (t ;θ )} ⎯⎯→ sup ,

(5)

θ

A constructive way of applying the information theoretic criterion that, meanwhile, would not be based on using restrictive preliminary limitations, may not be based on direct utilizing an analytical expression of the considered form since the latter is a functional in not known marginal and joint distribution densities of the output of the system (S) and the desired response (DR). Hence, a principal feature of such a constructive method is applying of suitable estimates of the information theoretic criterion built by use of sample data, rather than its analytical expression.

where y DR (t ) is the desired response. Simultaneously, in (5) substitution of the analytical expression for the mutual Rényi information I Rα of the order α with a suitable estimate ( N ) (1) y (1) ,K, y DR ;φ ,K,φ ( N )

DR IˆR

α

obtained

Let us consider a widely used not only in the theory and practice of the machine learning, class of non-linear system described by a linear-in-parameters mapping

φ

(1)

basing (N )

(1)

{θ } =

on

f (θ )

observations (N )

(6) of

sample

data

, K, φ , y ,K , y of the (generalized) input, φ (t ) , and desired response y DR (t ) of the system is implemented. In turn, in (6), 272

11th IFAC ALCOSP July 3-5, 2013. Caen, France

(

φ (i ) = φ1(i ) ,K, φ n(i )

)

T

= I Rα ( y DR (t ), y S (t ;θ )) − c3 Rα ( p DR ( y DR (t )))

, i = 1,K, N .

instead of I Rα ( y DR (t ), y S (t ;θ )) .

Here and below the upper index (N) is used for the corresponding estimate of the function over the sample of the length N.

Then, following to the approach of Mokkadem (1989) proposed for Shannon mutual information estimation, obtaining the estimates of

Then within the approach, the initial problem of the machine learning (4) with information theoretic criterion (5), (6) under availability of sample values of the input and output processes gives rise to a problem of finite dimensional optimization

( N ) (1) y (1) ,K, y DR ;φ ,K,φ ( N ) I~ˆR DR {θ } = ~fˆ (θ )

α

of the mutual information I~Rα ( y DR (t ), y S (t ;θ )) , using N

f (θ ) ⎯⎯→ sup , θ

pairs of sample data of the variables y DR (t ) and φ (t ) , is naturally based on applying the following relationships using the kernel estimates of the distribution densities (as well as above, here and below the upper index (N) is used for the corresponding estimate of the function over the sample of the length N):

to solve which deriving an explicit analytical expression for the function f (θ ) in (6) is required. In turn, such deriving is based on applying a corresponding technique of estimating the Rényi mutual information. The function f (θ ) may be obtained in various ways concerned with estimating joint and marginal distribution densities of the input and output processes, based on sample data. For the cases, the Rosenblatt’ kernel type density estimates (Rosenblatt, 1956a) are widely used. Estimating the Rényi mutual information is implemented via estimating the corresponding Rényi mutual and marginal entropies. Due to the conditions imposed on the Rényi mutual information in the preceding Section, the Rényi mutual information of the desired response y DR (t ) and system output y S (t ;θ ) is expressed via their marginal and mutual entropies as follows:

(1)

,K,φ ( N )

{θ } =

c1Rα( Nα) ( y DR (t ), y S (t ;θ )) + , 2 2

+ c2 Rα( N ) ( p DR, S ( y DR (t ), y S (t ;θ ) )) + c3 Rα( N ) ( p S ( y S (t ;θ )) , (7) Rα( N ) ( pDR, S ( y DR (t ), yS (t ;θ ) )) = =

, 2 2

+ c3 Rα ( p DR ( y (t )) p S ( y (t ;θ ))

(1)

α

I Rα ( y DR (t ), y S (t;θ )) = c1Rα α ( y DR (t ), y S (t ;θ )) + + c2 Rα ( p DR, S ( y DR (t ), y S (t ;θ ) )) +

(N )

y ,K, y DR ;φ I~ˆR DR

1 ln 1−α

∞ ∞

α (N ) ∫ ∫ (pDR, S ( yDR (t ), yS (t;θ ))) dyDR (t )d yS (t;θ ) ,

−∞ −∞

(8) Rα( N ) ( p DR ( y DR (t )) pS ( y S (t ;θ )) =

,

=

where, in accordance to the preceding Section,

1 × 1−α ∞ ∞

1 ⎧ c = 1, c2 = c3 = − , if α > 1 ⎪⎪ 1 2 , ⎨ 1 ⎪c = −1, c = c = , if α < 1 1 2 3 2 ⎩⎪

× ln

∫ ∫ (p

(N ) DR

( y DR (t ) ) pS( N ) ( y S (t;θ ) ))

α

, dy DR (t )dy S (t ;θ )

−∞ −∞

(9) Rα( Nα) ( y DR (t ), y S (t ;θ )) =

and p DR ( y DR (t ) ), pS ( y S (t ;θ ) ) , p DR, S ( y DR (t ), y S (t ;θ ) ) are

, 2 2

marginal and joint distribution densities of the desired response, y DR (t ) , and the system output, y S (t ;θ ) .

1 = ln 1−α

Again, since Rα ( p DR ( y DR (t )) p S ( y (t ;θ )) =

∞ ∞

∫ ∫ (p

(N ) DR , S

( y DR (t ), y S

(N ) p DR ( y DR (t )) =

I~Rα ( y DR (t ), y S (t ;θ )) =

p S( N ) ( y S (t ;θ ) ) =

273

×

,

(

and Rα ( p DR ( y DR (t ))) does not depend on θ, it will be equivalent to consider the function

2

−∞ −∞

(N ) × p DR ( y DR (t )) p S( N ) ( y S (t ;θ )

= Rα ( p DR ( y DR (t ))) + Rα ( pS ( y (t ;θ )) ,

α

(t ;θ ) ))

1 NhN

1 NhN

)

2

dy DR (t )dy S (t ;θ )

⎛ y (t ) − y (i ) DR K1 ⎜ DR ⎜ hN i =1 ⎝ N

∑ N

⎞ ⎟, ⎟ ⎠

⎛ y (t ;θ ) − θ T φ (i ) K2 ⎜ S ⎜ hN ⎝ i =1

∑

(10)

α

(11) ⎞ ⎟, ⎟ ⎠

(12)

11th IFAC ALCOSP July 3-5, 2013. Caen, France

(N ) p DR , S ( y DR (t ), y S (t ;θ ) ) =

1

N

⎛ y (t ) − y (i ) ⎞ ⎛ y (t ;θ ) − θ T φ (i ) DR DR ⎟ K ⎜ S ⎟ 2⎜ hN hN ⎝ ⎠ ⎝

∑ K1⎜⎜ NhN2 i =1

Bonev, B., Escolano, F., Giorgi, D., and S. Biasotti (2013). “Information-theoretic selection of high-dimensional spectral features for structural recognition”, Computer Vision and Image Understanding, vol. 117, no. 3, pp. 214-228. Garbarine, E., De Pasquale, J., Gadia, V., Polikar, R., and G. Rosen (2011). “Information-theoretic approaches to SVM feature selection for metagenome read classification”, Computational Biology and Chemistry, vol. 35, no. 3, pp. 199-209. Kamimura, R. (2010). “Information-theoretic enhancement learning and its application to visualization of self-organizing maps”, Neurocomputing, vol. 73, no. 13-15, pp. 2642-2664 Kotz, S., Balakrishnan, N., and N.L. Johnson (2000). Continuous Multivariate Distributions. Volume 1. Models and Applications / Second Edition, Wiley, New York, 752 p. Mokkadem, A. (1989). “Estimation of the entropy and information of absolutely continuous random variables”, IEEE Transactions on Information Theory, vol. IT-35, pp. 193-196. Nadarajah, S. and K. Zografos (2003). “Formulas for Rényi information and related measures for univariate distributions”, Information Sciences, vol. 155, no. 1, pp. 119-138. Nadarajah, S. and K. Zografos (2005). “Expressions for Rényi and Shannon entropies for bivariate distributions”, Information Sciences, vol. 170, no. 2-4, pp. 173-189. Porto-Díaz, I., Bolón-Canedo, V., Alonso-Betanzos, A., and O. Fontenla-Romero (2011). “A study of performance on microarray data sets for a classifier based on information theoretic learning”, Neural Networks, vol. 24, no. 8, pp. 888-896 Principe, J.C. (2010). Information theoretic learning, Springer, New York, Dordrecht, Heidelberg, London. Rényi, A. (1959). “On measures of dependence”, Acta Math. Hung., vol. 10, no 3-4, pp. 441-451. Rényi, A. (1960). “On measures of information and entropy”, in: Proceedings of the 4th Berkeley Symposium on Mathematics, Statistics and Probability, 1960, pp. 547-561. Rényi, A. (1976a). Some Fundamental Questions of Information Theory, Selected Papers of Alfred Rényi, Akademiai Kiado, Budapest, vol. 2, pp. 526-552. Rényi, A. (1976b). On Measures of Entropy and Information, Selected Papers of Alfréd Renyi, Akademiai Kiado, Budapest, vol. 2, pp. 565-580. Rosenblatt, M. (1956a). “Remark on some nonparametric estimates of a density function”, Annals of the Mathematical Statistics, vol. 27, no. 3, pp. 832-837. Rosenblatt, M. (1956b). “A central limit theorem and a strong mixing condition”, Proc. Nat. Acad. Sci., U.S.A., vol. 42, pp. 4347. Sarmanov, O.V. (1963). “The maximal correlation coefficient (nonsymmetric case)”, Sel. Trans. Math. Statist. Probability, vol. 4, pp. 207-210. Sarmanov, O.V and E.K. Zakharov (1960). “Measures of dependence between random variables and spectra of stochastic kernels and matrices”, Matematicheskiy Sbornik, vol. 52(94), pp. 953-990. (in Russian) Schaefer, R. (2007). Foundations of global genetic optimization. Berlin, Heidelberg, Springer, New York, 222 p. Singh, A., and J.C. Principe (2011). “Information theoretic learning with adaptive kernels”, Signal Processing, vol. 91, no. 2, pp. 203-213 Vapnik, V. (1998. Statistical Learning Theory, John Wiley & Sons, New York. Zografos, K. and S. Nadarajah (2005). “Expressions for Rényi and Shannon entropies for multivariate distributions”, Statistics & Probability Letters, vol. 71, no. 1, pp. 71-84.

⎞ ⎟ . (13) ⎟ ⎠

In expressions (8)-(13), {hN } is a vanishing sequence of positive numbers; in expressions (11)-(13), K j (⋅) , j = 1,2 are positive bounded kernels at R1. Under assumptions that ( y DR (t ), y S (t;θ ) ) meet the strongly mixing condition (Rosenblatt, 1956b), and under corresponding conditions on integrability imposed on the kernels K j (⋅) , and densities p DR ( y DR (t ) ), pS ( y S (t ;θ ) ) p DR, S ( y DR (t ), y S (t ;θ ) ) (formulae (3)-(7) of Mokkadem j = 1,2 ,

(1989)), estimate (9) converges in the mean-square sense. Meanwhile, one should be noted that expressions (8) to (10) are considerably more simpler than that of Mokkadem (1989), since expressions (8) to (10) do not require the limit transfer of the form: “ h −1 ln

(∫ ( p( x))

h +1

) ∫ p( x) ln( p( x))dx as h → 0 ”.

dx →

Thus, finally, the required performance index f (θ ) of an information-theoretic type subject to maximization over θ is formed by subsequent substitution of expressions (8)-(13) into expression (7). Of course, from an analytical point of view, expression (7) ~ for the function f (θ ) looks rather complex and the function may have several local maximums; and the natural way to solve such an optimization problem is applying the genetic algorithms, e.g. (Baeck et al., 1997, Schaefer, 2007), being an efficient tool for numerical function optimization. 4. CONCLUSIONS In the paper, an approach to the machine learning is presented. Within the approach, the key issue of the problem is a proper handling of inherent dependence between the variables. Using a consistent measure of stochastic dependence of random processes, the Rényi mutual information, has been proposed within the identification scheme, and the paper presents an approach to the machine learning in accordance to the information-theoretic criterion derived. Meanwhile, the parameterized description is utilized combined with a corresponding technique of estimation of the mutual information, leading, finally, to a finite dimensional optimization problem. Solution to the latter may be obtained by applying an appropriate technique of multivariate function optimization. REFERENCES Baeck, T., D.B. Fogel and Z. Michalewicz (Eds.) (1997). Handbook of Evolutionary Computation. Institute of Physics Publishing, Bristol, 988 p. Basseville, M. (2013). “Divergence measures for statistical data processing – An annotated bibliography”, Signal Processing, vol. 93, no. 4, pp. 621-633. 274