A robust loss function for classification with imbalanced datasets

A robust loss function for classification with imbalanced datasets

Communicated by Jun Yu Accepted Manuscript A Robust Loss Function for Classification with Imbalanced Datasets Yidan Wang, Liming Yang PII: DOI: Refe...

842KB Sizes 0 Downloads 14 Views

Communicated by Jun Yu

Accepted Manuscript

A Robust Loss Function for Classification with Imbalanced Datasets Yidan Wang, Liming Yang PII: DOI: Reference:

S0925-2312(18)31352-3 https://doi.org/10.1016/j.neucom.2018.11.024 NEUCOM 20157

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

23 July 2018 28 October 2018 10 November 2018

Please cite this article as: Yidan Wang, Liming Yang, A Robust Loss Function for Classification with Imbalanced Datasets, Neurocomputing (2018), doi: https://doi.org/10.1016/j.neucom.2018.11.024

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Highlights

• The proposed model combines this loss with SVM. • The robustness of model is analyzed in theory. • The Bayes optimal solution is derived.

CR IP T

• A new robust loss function is designed for imbalanced data sets.

AC

CE

PT

ED

M

AN US

• Alternative iterative algorithm is designed to reduce algorithm complexity.

1

ACCEPTED MANUSCRIPT

A Robust Loss Function for Classification with Imbalanced Datasets

CR IP T

Yidan Wang, Liming Yang∗ College of Science, China Agricultural University, 100083, Beijing, Haidian, China.

Abstract

AN US

Based on minimizing misclassification cost, a new robust loss function is de-

signed in this paper to deal with the imbalanced classification problem under noise environment. It is nonconvex but maintains Fisher consistency. Applying the proposed loss function into support vector machine (SVM), a robust SVM framework is presented which results in a Bayes optimal classifier. However, nonconvexity makes the model difficult to optimize. We develop an alternative

M

iterative algorithm to solve the proposed model. What’s more, we analyze the robustness of the proposed model theoretically from a re-weighted SVM viewpoint and the obtained optimal solution is consistent with Bayesian optimal

ED

decision rule. Furthermore, numerical experiments are carried out on databases that are drawn from UCI Machine Learning Repository and a practical applica-

PT

tion. With two different types of noise environments, one with label noise and one with feature noise, experiment results show that on these two databases the proposed method achieves better generalization results compared to other SVM

CE

methods.

Keywords: Robustness; Imbalanced Classification; Bayes optimal classifier;

AC

Fisher consistency; Half quadratic optimization

∗ Corresponding

author Email address: [email protected] (Liming Yang)

Preprint submitted to Elsevier

November 20, 2018

ACCEPTED MANUSCRIPT

1. Introduction In the classification problem, imbalanced issue is a very important branch that a lot of researchers have studied [10, 36, 35]. It can be found everywhere

CR IP T

in practical applications such as face recognition, financial risk decision, crime

analysis, medical diagnosis, etc. What’s more, with the continuous progress of computer technology, imbalanced issue is becoming more and more significant

and it can be found in many research fields such as deep learning [29, 28, 31], convolutional neural networks (CNNs) [20, 18], computer vision [8, 7, 23], etc.

For example, in [29], Yu et al dealt the problem that recognizing the privacy

AN US

image from a great deal of sharing images. In this task, the number of selected images that are privacy for users is relatively small. Hong et al [8] faced with the problem that recovering human pose from monocular videos which was

also an imbalanced issue. When dealing with this kind of problems, traditional classifiers such as decision tree [9], support vector machine [1, 21], bayes network

M

[4] and CNNs [20, 18] may not perform well.

In the literature, there are two main branches to solve this kind of problem: one is based on manipulating data sets to be balanced such as over sampling and

ED

under sampling [13, 30, 1]; the other is based on improving existing algorithms which aims at shifting the decision hyperplane away from the minority class

PT

such as two phase induction method [27] and cost sensitive learning method [25, 32, 22, 3] where the latter is most relevant to this article. Among these methods, if a larger error penalty is given for minority class

CE

and ulteriorly minimize misclassification cost in the sense of probability rather than minimizing the misclassification point numbers, the model may be more effective to derive the optimal solution. For this superiority, cost sensitive models

AC

by increasing the misclassification cost [12] of the sample points in the minority class are especially suitable and thus are widely used for imbalance issue. With applying this technology, Zhang [33] further explained the reason why classification algorithms such as SVM along with its variants, logistic regression and boosting could perform quite well. He also claimed that all of the algorithm

3

ACCEPTED MANUSCRIPT

methods satisfy Bayes optimal decision rule. It gives researchers good enlightenment. Besides, Lin [14] pointed out that, margin-based loss functions that are all Fisher consistent under general conditions are also satisfied Bayes opti-

CR IP T

mal decision rule method. In 2008, Masnadi-Shirazi and Vasconcelos [16] put forward a SavageBoost algorithm and they claimed that a classifier satisfing the

Bayes optimal decision rule method with minimizing the conditional risk might

be an alternative choice compared with minimizing empirical risk. After a few years, a cost sensitive vector machine model was put forward [17]. Different from the established model by [34] where cost matrices were induced, this model was

AN US

based on margin loss and had more statistical significance. They claimed that this Bayes classifier is aimed at obtaining the best classification hyperplane by

minimizing the misclassification cost rather than the misclassification rate. Another comprehensible explanation for this kind of system was that, if the data sets are imbalanced, it is not fair to give the same misclassification cost. However, all the classification models mentioned above may not perform well

M

when the data sets contain uncertain noise. For cost sensitive model, although it reduces the cost of the overall conditional risk, the effects of noise are re-

ED

mained out of control. There are a lot of methods to make the model capable to deal with the noisy data sets. For example, researchers make their model to be robust by introducing uncertainty distribution into the datasets to develop a ro-

PT

bust support vector machine [19], or establishing a bi-level cost sensitive model for imbalanced classification problem [11]. Convolution neural network [20, 18]

CE

is also a robust method. The neurons in the convolutional Layer are connected with a small area of the input samples (local receptive fields). This structure can tolerate input samples containing noise. What’s more, compared with tra-

AC

ditional neural networks, convolutional neural networks use small-area convolution operations and share weights in the convolution layer, which reduces the number of network parameters to be trained and reduces the complexity of the model. However, convolutional neural network needs a large number of samples for training in practice, and will inevitably appear over-fitting phenomenon. For unbalanced classification problems, the number of positive (minority) samples 4

ACCEPTED MANUSCRIPT

is less than that of negative (majority) samples, thus, it may not be adequately learned for this class in convolutional neural networks. In 2017, a rescaled hinge loss [24] was put forward by combining the hinge loss function with correntropy

CR IP T

[15] which originated in the information theory. It showed that the rescaled hinge loss was an effective approach to handle the noise environment.

In this paper, motivated by this robust technology, a new robust loss function

is induced to face with the imbalanced data sets. Different from the cost sensitive learning methods mentioned above, this proposed model is more appropriate to

deal with the imbalanced data sets when there exist noise in training sets. The

AN US

main contributions of this paper are summarized as follows:

First, inspired by the connection between the loss function and the Bayes risk, a new robust non-convex loss based on the rescaled hinge loss is designed for conducting the imbalance issue. The designed loss minimizes the conditional risks and we verify the Fisher consistency in theory.

Second, we apply the new loss into SVM framework to obtain a robust

M

classifier. And then, we analyze the robustness of this model from a re-weighted SVM viewpoint in theory. It turns out that the proposed method is more

ED

significant and effective in applications.

Third, we perform alternative iterative algorithm to deal with the nonconvexity and reduce the computational complexity.

PT

In addition, comparative experiments with rescaled hinge loss support vector machine (RHSVM for short) [24] and cost sensitive support vector machine

CE

(CSSVM for short) [17] are carried out in the environment with different types of noise: one with no noise, one with 10 percent label noise and one with 10 percent feature noise. The results indicate that the proposed model improved

AC

the accuracy either of positive samples or the overall accuracy. In particular, the results also show the robustness whenever meeting with label noise or feature noise. The reminder of this paper is organized as follows. In Section 2, we give a

brief view of cost sensitive SVM model. A new robust loss function especially for imbalanced issue based on rescaled hinge loss is derived in Section 3.1, where 5

ACCEPTED MANUSCRIPT

the Fisher consistency is analyzed in theory. In Section 3.2, some important properties and analyses are given. In Section 3.3, a robust cost sensitive SVM model with this designed loss is established and corresponding alternative algo-

CR IP T

rithm is represented. What’s more, the robustness is further analyzed from a re-weighted view point. In section 4, numerical experiments are given to illustrate the validity of the proposed model and conclusions of our paper are given in Section 5.

2. Cost Sensitive SVM

AN US

In this section, we briefly introduce the Fisher consistency rule and the cost

sensitive SVM which is consistent with Bayes decision rule. For details, please refer to [17], [16] and [14].

The goal of the classifier is to map the feature vector x ∈ Rn×1 in the feature space to the class label y ∈ {−1, 1}, where −1 and 1 denote as the

M

negative and positive class respectively. Assume the feature vector as well as the class label are drown from different probability distributions. p(x) is usually regarded as the posterior probability of a classifier predicting the feature vector

ED

xn×1 to be positive class. Then, 1 − p(x) is the posterior probability of a classifier predicting the feature vector to be negative class. Thus, if p(x) >

PT

1 − p(x), the sample point is classified into positive class, conversely, negative class. Therefore, h(x) = sign(2p(x) − 1) can accurately predict every feature vector overall samples. Note L(yf (·)) as a loss function that is drawn into the

CE

classification models to measure the performance of classifiers, where f (·) is a

prediction function, we have the next definition:

AC

Definition 2.1. The global expected loss function known as conditional risk is CL (p(x), f (·)) = p(x)L(f (·)) + (1 − p(x))L(−f (·))

(1)

An optimal classifier is the minimizer of (1) f ∗ = arg min CL (p(x), f (·)). f

6

(2)

ACCEPTED MANUSCRIPT

If sign(f ∗ ) = h(x), the loss function L is Fisher consistent and the classifier f ∗ is consistent with Bayes decision rule.

CR IP T

Theorem 2.1. ([14]) Let L be a loss function and satisfies assumptions: 1. L(z) < L(−z), ∀z > 0; 2. L0 (0) 6= 0 exists;

where z = yf (·) denotes the margin. If CL (p(x), f (·)) has a global minimizer f ∗ and sign(f ∗ ) = h(x), then, the loss function L is Fisher consistent.

AN US

Proof. According to (1), the global expected loss function has the following form:

CL (p(x), f (·)) = p(x)L(f (·)) + (1 − p(x))L(−f (·)). The first order derivative is CL0 (p(x), f (·)) = p(x)L0 (f (·))−(1−p(x))L0 (−f (·)), Assume the minimizer of (1) is fˆ. Thus, fˆ = arg minf CL (p(x), f (·)). Next,

M

we discuss the case f = 0, f > 0 and f < 0 respectively.

• If f = 0, CL0 (p(x), 0) = p(x)L0 (0) − (1 − p(x))L0 (0) = (2p(x) − 1)L0 (0). According to assumption 2, L0 (0) 6= 0, thus, CL0 (p(x), 0) 6= 0. Therefore,

ED

f = 0 is not the the minimizer of (1). • If f > 0, for fˆ is the minimizer of (1), CL (p(x), fˆ(·))−CL (p(x), −fˆ(·)) < 0.

PT

CL (p(x), fˆ(·)) − CL (p(x), −fˆ(·)) = p(x)L(fˆ(·)) + (1 − p(x))L(−fˆ(·))

AC

CE

− p(x)L(−fˆ(·)) − (1 − p(x))L(fˆ(·))

(3)

= (2p(x) − 1)(L(fˆ(·)) − L(−fˆ(·))) < 0.

According to assumption 1, L(fˆ(·)) − L(−fˆ(·)) < 0. Thus, 2p(x) − 1 > 0. That is, p(x) >

1 2

when f > 0.

• If f < 0. According to assumption 1 and (3), L(fˆ(·)) − L(−fˆ(·)) > 0. Thus, p(x) <

1 2

when f < 0. 7

ACCEPTED MANUSCRIPT

In summary, fˆ is the global minimizer of (1) and sign(fˆ) = sign(2p(x) − 1). According to Definition 2.1, The loss function L(z) is Fisher consistent. Let the cost of a false negative sample to be C1 , the cost of a false positive

CR IP T

sample to be C−1 , f = f (·) and (x)+ = max(x, 0). The cost sensitive loss function is

Let kk =

  C (1 − yf ) , y = 1; 1 + φC1 ,C−1 (yf ) =  (1 − (2C − 1)yf ) , y = −1. −1 +

1 2C−1 −1 ,

(4)

the corresponding cost sensitive SVM model (CSSVM for

AN US

short) is aimed at searching for the optimal prediction function f (x) = wT x + b to minimize the global expected loss function (defined in Definition 2.1), where w ∈ Rn×1 and b ∈ R is obtained by [17] and the established model has the following form:

w,b

s. t.

1 kwk2 + CC1 2

X

ξi+ +

{i|yi =1}

M

min

(w · xi ) + b ≥ 1 − ξi+ ; (w · xi ) + b ≤ −kk +

ξi− ;

C kk

X

ξi−

{i|yi =−1}

yi = 1

(5)

yi = −1

ED

ξi+ , ξi− ≥ 0.

PT

3. Robust cost sensitive support vector machine In many practical cases, the binary classification problem is ubiquitous, such

CE

as, face recognition, financial risk decision, crime analysis, medical diagnosis etc. However, the target two data sets are more likely to be imbalanced. In this situation, a classifier induced by a robust cost sensitive loss function may

AC

be more suitable to handle those cases. In the following, we will design a new robust loss function. It is worth noting that the new loss function is robust to noise and is especially effective to meet with the imbalanced issue. Therefore, it is of great significance.

8

ACCEPTED MANUSCRIPT

3.1. Robust Cost Sensitive Loss function In this subsection, a robust cost sensitive loss function is designed for imbalanced classification problem. We first introduce the rescaled hinge loss into

CR IP T

the global expected loss function to derive the global minimizer f ∗ . Then, for imbalanced cases, we extend the global minimizer to a cost sensitive form. Finally, by adopting the technology that is shown in [17], we derive the robust cost sensitive loss function.

Rescaled hinge loss function is put forward by Xu et al. in [24]. The func-

AN US

tional form is as follows:

Lr(yf ) = β(1 − exp(−η(1 − yf )+ )),

(6)

where (x)+ = max(0, x), β = 1/(1 − exp(−η)) and η > 0 is a parameter needed to be adjusted. As shown in Figure 1, this loss becomes flat when yf decreases sharply. Therefore, it is insensitive to large value of 1 − yf and is used to handle the interference of outliers to the model.

M

First, we give the following analysis by applying Definition 2.1. Let p = p(x), the conditional risk with the rescaled hinge loss is:

ED

C(p, f ) =pLr(f ) + (1 − p)Lr(−f )

PT

=pβ(1 − exp(−η[1 − f ]+ ))

(7)

+ (1 − p)β(1 − exp(−η[1 + f ]+ )).

CE

The global minimizer of C(p, f ) is f ∗ = sign(2p − 1)

(8)

However, the rescaled hinge loss gives same penalty for false negative samples

AC

as well as false positive samples. In practical applications, binary classification tends to face with imbalanced datasets for the sample points collected are more commonly to be imbalanced. Based on Bayes decision theory, designing a classifier with different misclassification errors for false negative samples as well as false positive samples is becoming particularly necessary. In the following, we extend the rescaled hinge loss to be a cost sensitive form to better improve the 9

ACCEPTED MANUSCRIPT

performance when dealing with noisy and imbalanced data. First, we extend (8) into a cost sensitive form according to [33]: f ∗ = sign((C1 + C−1 )p − C−1 )

CR IP T

(9)

where C1 and C−1 are the cost of a false negative and false positive respectively. Replace the equation f in the conditional risk function with the above form, the minimum conditional risk has the following form:

C ∗ (p, f ∗ ) =pβ(1 − exp(−η(1 − sign((C1 + C−1 )p − C−1 ))+ ))

AN US

+ (1 − p)β(1 − exp(−η(1 + sign((C1 + C−1 )p − C−1 ))+ ))  C−1  pβ(1 − exp(−2η)), p ≤ C−1 +C1 ; =  (1 − p)β(1 − exp(−2η)), p > C−1 .

(10)

C−1 +C1

Next, we introduce degrees of freedom parameters a, b, e, d in the (10) to make the function more flexible but maintain the form of (10). The freedom parameters are of great significance for implementers to obtain a better solution.

M

Thus, the C ∗ (p, f ∗ ) can be extended to the following form:

ED

C ∗ (p, f ∗ ) = pβ(1 − exp(−η[e − d · sign((C1 + C−1 )p − C−1 )]+ ))+

PT

(1 − p)β(1 − exp(−η[b + a · sign((C1 + C−1 )p − C−1 )]+ ))  C−1  pβ(1 − exp(−η(e + d))), p ≤ C−1 +C1 ; =  (1 − p)β(1 − exp(−η(a + b))), p > C−1 .

(11)

C−1 +C1

AC

CE

where a, b, e, d satisfy:

For p >

C−1 C−1 +C1 ,

   a ≥ b;     d ≥ e;    C−1  1 − exp(−η(a + b))  =  1 − exp(−η(d + e)) C1

(12)

sign((C1 + C−1 )p − C−1 ) = 1. To agree with C ∗ (p, f ∗ ) =

(1 − p)β(1 − exp(−η(a + b))) when p >

C−1 C−1 +C1 ,

pβ(1 − exp(−η[e − d · sign((C1 +

C−1 )p − C−1 )]+ )) = 0 must hold. Therefore, d ≥ e. We can prove a ≥ b by the 10

ACCEPTED MANUSCRIPT

same argument. Also, for the cost of a false negative sample is C1 , the cost of a false positive sample is C−1 , then, the third equation in (12) holds. In the (11), p >

C−1 C−1 +C1

means that the sample should be divided into C−1 C−1 +C1

means

CR IP T

positive class and the loss is (1 − p)β(1 − exp(−η(a + b))). p ≤

that the sample should be divided into negative class and corresponding loss is

pβ(1 − exp(−η(e + d))). d ≥ e, a ≥ b ensures the samples that are correctly classified has no loss.

C−1 C1

=

1−exp(−η(a+b)) 1−exp(−η(d+e))

ensures C ∗ is a concave function

on p which is very realistic in the sense of probability: if p is closer to

1 2,

the

sample is much harder to classify and the greater the loss is when the sample is

AN US

classified.

Without loss of generality, let e = d = C1 , b = 1, then a = −1 − η1 ln(1 −

C−1 C1

+

C−1 C1 exp(−2ηC1 ))

> 0. The robust cost sensitive loss is:   β(1 − exp(−ηC [1 − yf ] )), y = 1; 1 + LrC1 ,C−1 (yf ) =  β(1 − exp(−η[1 − ayf ] )), y = −1. +

(13)

M

which can be seen in Figure 1. Ulteriorly, Figure 1 also clarifies that the margin from the negative class hyperplane is much smaller than the other side. This

ED

may reduce the number of the negative class points falling in the margin to some certain degree. And thus, the penalized sample points in the negative class are reduced. This is effective when dealing with the imbalance binary classification

PT

problem when the negative class has much more points than positive class. 3.2. Properties and Analyses

CE

In this subsection, we devote to analyze the property of the minimum condi-

tional risk C ∗ (p, f ∗ ), the Fisher consistency and the robustness of the proposed robust cost sensitive loss. First, we analyze the property of the minimum con-

AC

ditional risk C ∗ (p, f ∗ ): Theorem 3.1. The minimum conditional risk C ∗ (p, f ∗ ) has the following properties: 1. C ∗ (p, f ∗ ) is a concave function on p; 2. For all δ ∈ [0, C−11+C1 ], C ∗ (p∗ − C−1 δ) = C ∗ (p∗ + C1 δ); 11

ACCEPTED MANUSCRIPT

4 robust cost sensitive loss with y=1 robust cost sensitive loss with y=−1 0−1 loss hinge loss

3.5

3

2.5

1

0.5

0 −3

−2

−1

0 yf

1

2

CR IP T

2

1.5

3

Figure 1: robust cost sensitive loss LrC1 ,C−1 (yf ) with y = 1 and y = −1 are shown by blue

AN US

and green lines, hinge loss and 0-1 loss are shown by bluish green and red lines where η = 0.5.

3. C ∗ (p, f ∗ ) reaches to the maximum at p∗ =

C−1 C−1 +C1 ;

4. If C ∗ (p, f ∗ ) satisfies property 1 and 2, then, it can necessarily has the property 3. Proof.

p≤ p>

C−1 C−1 +C1 ; C−1 C−1 +C1 .

ED

M

  pβ(1 − exp(−η(e + d))), C ∗ (p, f ∗ ) =  (1 − p)β(1 − exp(−η(a + b))),

(14)

From the above equation, we can see that C ∗ (p, f ∗ ) is concave with respect to p. Then, the property 1 holds. What’s more, let pβ(1 − exp(−η(e + d))) =

PT

((1 − p)β(1 − exp(−η(a + b))), according to (12), we have p =

C−1 C−1 +C1 ,

thus, the

property 3 holds.

For all δ ∈ [0, C−11+C1 ], p∗ −C−1 δ ∈ [0, p∗ ], thus, C ∗ (p∗ −C−1 δ) =

C−1 C−1 +C1 (1−

CE

δ)β(1 − exp(−η(e + d))); p∗ − C1 δ ∈ [p∗ , 1], thus, C ∗ (p∗ + C1 δ) = (1 − (p∗ +

C1 δ))β(1 − exp(−η(a + b))) =

C1 C−1 +C1 (1

− δ)β(1 − exp(−η(a + b))). Again, by

AC

adjusting (12), we derive property 2. According to property 2, for all δ ∈ [0, C−11+C1 ], we have C ∗ (p∗ − C−1 δ) =

C ∗ (p∗ + C1 δ), taking derivatives on both sides when δ = 0, we have 0

C∗ (

0 C−1 C−1 )(−C−1 ) = C ∗ ( )C1 C−1 + C1 C−1 + C1 0

(15)

C−1 ∗ ∗ The equation holds when C ∗ ( C−1 +C1 ) = 0, according to property 1, C (p, f )

12

ACCEPTED MANUSCRIPT

is a concave function on p, thus, p∗ =

C−1 C−1 +C1

is the maximum of C ∗ (p, f ∗ ).

Therefore, property 4 holds.

CR IP T

Next, we verify the Fisher consistency of the robust cost sensitive loss. Theorem 3.2. The proposed loss function LrC1 ,C−1 (yf ) is Fisher consistent and the constructed classifier induced by this loss is Bayes consistent.

Proof. Assume z = yf , then, by applying Theorem 2.1, we just verify the following formula:

AN US

LrC1 ,C−1 (−z) − LrC1 ,C−1 (z)   β(exp(−ηC [1 − z] ) − exp(−ηC [1 + z] )), y = 1; 1 + 1 + =  β(exp(−η[1 − az] ) − exp(−η[1 + az] )), y = −1. + +

(16)

0 For all z > 0, LrC1 ,C−1 (−z) − LrC1 ,C−1 (z) < 0 holds. Besides, LrC (0) 6= 0 1 ,C−1

can be easily obtained. What’s more, when C−1 = C1 , (9) and (8) are equal.

M

Thus, the robust cost sensitive loss function satisfies the Fisher consistency and the classifier induced by this robust cost sensitive loss function is Bayes

ED

consistent.

3.3. Robust Cost Sensitive SVM Model In order to improve the performance when classifying the imbalanced data

PT

sets in the noise environment, we model the robust cost sensitive SVM model with the derived loss function and give the iterative algorithm. First, put the

CE

loss function into the support vector machine framework and drop off constant

AC

items, we have: X 1 max − kwk2 + Cβ exp(−ηC1 [1 − fi ]+ ) w,b 2 + Cβ

X

{i|yi =1}

(17)

exp(−η[1 + afi ]+ ).

{i|yi =−1}

where fi = wT xi + b represents the predicted value of i-th sample points. However, the model is difficult to solve directly for its non-convexity. To simplify

13

ACCEPTED MANUSCRIPT

the model, we make the following analyses. According to the duality theory, by introducing a convex function g(v) = −v · log(−v) + v (v < 0), we have: exp(−ηC1 [1 − fi ]+ ) = sup ηC1 [1 − fi ]+ · v1 − g(v1 ) v1 <0

CR IP T

(18)

exp(−η[1 + afi ]+ ) = sup η[1 + afi ]+ · v2 − g(v2 ) v2 <0

Detailed derivation process of the above formula, please refer to [24]. By ap-

plying this technology, the final model with the proposed robust loss can be transformed into the following form:

X 1 ηC1 [1 − fi ]+ · v1 − g(v1 )} − kwk2 + Cβ{ 2 X

+ Cβ{

AN US

max

w,b,v1 ,v2 <0

{i|yi =1}

{i|yi =−1}

(19)

η[1 + afi ]+ · v2 − g(v2 )}.

(k)

where (w, b) and (v1 , v2 ) are alternatively optimized. For given value v1 and

(k) v2

<0

< 0, the model is equivalent to solving the following model to derive

min

1 (k) kwk2 − CβηC1 v1 2

ED

w,b

M

optimal (w(k) , b(k) ):

s. t. (w · xi ) + b ≥ 1 −

X

(k)

{i|yi =1}

ξi+ − Cβηv2

ξi+ ;

X

ξi−

{i|yi =−1}

yi = 1

PT

1 1 (w · xi ) + b ≤ − + ξi− ; a a

(20)

yi = −1

ξi+ , ξi− ≥ 0. (k+1)

CE

When (w(k) , b(k) ) is obtained, we update (v1

(k+1)

, v2

) via half quadratic opti-

mization algorithm which can be related to [6]. The optimal solution is reached

AC

at

(k+1)

= −exp(−ηC1 [1 − f (w(k) , b(k) )]+ ),

(k+1)

= −exp(−η[1 + af (w(k) , b(k) )]+ ).

v1

v2

(21)

(k)

(k)

The alternate iterative algorithm flow can be simply expressed as: (v1 , v2 ) → (k)

(k+1)

(w(k) , b(k) , ξi ) → (v1

(k+1)

, v2

). Detailed algorithm steps can be seen in

Algorithm 1 and the flow chart of the Algorithm 1 is shown in Figure 2. 14

ACCEPTED MANUSCRIPT

Algorithm 1 Iterative Algorithm Input: Sample space X, penalty parameter C, the cost of positive and negative class C1 and C−1 and the maximum total iteration number K; 1: 2:

(0)

(0)

Initial point v1 := −1 and v2 := −1 and set k=0;

CR IP T

Output: Optimal solution (w, b). While k ≤ K and kw(k) − w(k−1) k ≤  where  is a very small threshold, do the model (19); 3:

Obtain (w(k) , b(k) );

4:

Update (v1

5:

Set k=k+1;

6:

Return the final solution (w, b).

(k+1)

, v2

) by introducing (w(k) , b(k) ) into (21);

AN US

(k+1)

(0)

Remark: In the first iteration, Set v1

(0)

:= −1 and v2

:= −1, this model

is equivalent to a cost sensitive SVM model. That is, the first iteration solves a cost sensitive SVM model.

M

In the next, we analyze the convergence of Algorithm 1. Fortunately, from the following Theorem 3.3, Algorithm 1 is convergent.

ED

Theorem 3.3. Denoted the objective function value of (20) as A(w, v1 , v2 ). The sequence {A(w, v1 , v2 ), k = 1, 2, · · · , K} generated by Algorithm 1 is convergent.

PT

Proof. Suppose w(k) is obtained by (20), that is, (k−1)

A(w(k) , v1

(k−1)

, v2

(k−1)

) ≤ A(w(k−1) , v1

(k−1)

, v2

),

CE

With the w(k) , we minimize the (20) by Algorithm 1 to obtain the optimal (k)

(k)

AC

(v1 , v2 ), then (k)

(k)

(k−1)

A(w(k) , v1 , v2 ) ≤ A(w(k) , v1

(k−1)

, v2

).

The above two simultaneous equations can result in (k)

(k−1)

(k)

A(w(k) , v1 , v2 ) ≤ A(w(k−1) , v1 (k)

(k)

(k−1)

, v2

).

In addition, A(w(k) , v1 , v2 ) ≥ 0 ensures Algorithm 1 to be convergent. 15

CR IP T

ACCEPTED MANUSCRIPT

AN US

Figure 2: The basic framework of the proposed method: First, preprocess the data: Processing

missing values, data normalization, regularization, processing missing values, etc. Second, parameter selection: C, C1 , C−1 and K. Third, begin Iterative Algorithm 1 to derive the optimal solution.

Theorem 3.4. Algorithm 1 can effectively improve the robustness to label

M

noises as well as feature noises. For the samples with large positive value of yf , the loss is fixed, which have no contributions to the optimization of the model. For the samples with negative value of yf , the weights in the loss decrease to

ED

very small value. For the samples with feature noise, the perturbations to the weights are under control.

PT

Proof. We analyze the robustness by decomposing in each iteration. Now, con(k)

sider the (k+1)-th iteration, let λ1i = Cβηexp(−ηC1 [1 − yi f (k) (xi )]+ ) and (k)

CE

λ2i = Cβηexp(−η[1 − ayi f (k) (xi )]+ ), this model is equivalent to a reweighted

AC

cost sensitive SVM model: min

w,b,v<0

s. t.

X (k) X (k) 1 kwk2 + λ1i ξi+ + λ2i ξi− 2 i|yi =1

(w · xi ) + b ≥ 1 − ξi+ ;

i|yi =−1

1 1 (w · xi ) + b ≤ − + ξi− ; a a

yi = 1

(22)

yi = −1

ξi+ , ξi− ≥ 0. From the above model, we can see that the weight of positive and neg16

ACCEPTED MANUSCRIPT

ative class loss is different. In (k + 1)-th iteration, the positive class loss is (k)

(k)

λ1i = Cβηexp(−ηC1 [1 − yi f (k) (xi )]+ ) and the negative class loss is λ2i =

Cβηexp(−η[1 − ayi f (k) (xi )]+ ). However, due to the symmetry of the two vari(k)

CR IP T

ables, we analyze the λ1i in the kth iteration here only, we have the following conclusions:

1. For the samples with very large positive number yi f (k) (xi ), by calculating, (k)

λ1i = Cβη. That is, the very right points are classified with fixed loss.

2. If there exists label noise in the training data set, then yi f (k) (xi ) is a (k)

negative number with a large absolute value, the weight λ1i decrease to a

AN US

very small value. This may cause robustness of our model for label noise.

3. If there exist feature noise in the training data set, then, the perturba(k)

tion of λ1i under some certain noise distribution caused by the function exp(−ηC1 [1 − yi f (k) (xi )]+ ) must not vary much. This may cause our

M

model insensitive to feature noise.

ED

4. Numerical Experiment

To illustrate the validity of our model established, we represent numerical experiments in this section. In the first experiment, we compare the cost sensitive

PT

SVM (CSSVM for short) [17] with SVM with the rescaled hinge loss(RHSVM for short) [24] and robust cost sensitive SVM (RCSSVM). In the second exper-

CE

iment, we compare the RCSSVM with the RHSVM under the same parameter selection. In the third experiment, we present the experimental results with

AC

label noise and feature noise. The so called 10-fold cross-validation is used to estimate the performance

of the algorithm to ensure the fairness. Evaluation criterions that used in this trial is as follows: Accuracy (ACC), Mathew Correlation Coefficient(MCC),

17

ACCEPTED MANUSCRIPT

F1 -measure (F1 ) and Recall (R), which are defined as [5]. TP + TN 2 × TP , F1 = TP + TN + FP + FN 2 × TP + FP + FN TP × TN − FP × FN M CC = p (T P + F P )(T P + F N )(T N + F P )(T N + F N ) TP R= . TP + FN where TP and TN denote true positives and true negatives, FN and FP denote false negatives and false positives, respectively.

CR IP T

ACC =

It is important to point out that Recall reflects the classification ability of model to identify the positive samples. The higher the Recall is, the stronger

AN US

recognition ability the model for positive samples is. F1-score is a combination of identifying both the positive and negative samples.

4.1. Comparisons Results between CSSVM, RHSVM and RCSSVM with UCI data sets

In this subsection, The used data sets are drawn from UCI Repository

M

for Machine Learning Databases [2].

We represent these data in Table 1.

We compare the RCSSVM(η = 0.2), RCSSVM(η = 0.5), RHSVM(η = 0.2),

ED

RHSVM(η = 0.5) and CSSVM algorithms. In the RCSSVM(η = 0.2 and η = 0.5) model, we can see that the distance from the classification hyperplane to the positive is 1. The distance from the classification hyperplane to the

PT

negative hyperplane is kk =

1 a

where a = −1 − η1 ln(1 − CC−1 + CC−1 exp(−2ηC1 )). 1 1

What’s more, the misclassification punishment is different from the false nega-

CE

tive to the false positive. For CSSVM model, we fix kk =

1 2C−1 −1

= 0.5 and the

parameter C1 is selected from 0.8 to 2.5 and the step length is 0.1, therefore, C1 has 18 candidates. For RHSVM and RCSSVM models, the parameter C−1 is

AC

fixed at C−1 = 1 and C1 is selected from 0.8 to 2.5 and the step length is 0.1. In

this text, the parameters C is selected from set {0.001, 0.01.0.1, 1, 10, 100, 1000}. Other fixed parameter is selected as follows: total iteration number is K = 15 as set in [24]. The experimental results are shown in Table 2, where the bold numbers represent the best results. The optimal parameters are represented in Table 7. 18

ACCEPTED MANUSCRIPT

Table 1: Data sets in the experiment

sample No.

feature No.

positive:negative

CMC

1140

9

1:1.23

german

1000

24

1:2.33

hepat

155

19

Ionosp

350

34

QSAR

1055

41

wholesale

440

7

CR IP T

Dataset

1:3.84 1:1.8

1:1.96 1:2.1

AN US

Table 2 shows that the RCSSVM performs better when measured by the

criterions ACC, M CC, and F 1, but low rate of R. In the CSSVM model, we fix kk =

1 2C−1 −1

= 0.5 and tune the parameter C1 from 0.8 to 2.5 every 0.1 steps.

In the RHSVM and RCSSVM models, we fix C−1 = 1 and tune the parameter C1 from 0.8 to 2.5 every 0.1 steps and tune the parameter η from 0.2 to 0.5. We

M

list the RHSVM( η = 0.2 and η = 0.5 ) and RCSSVM( η = 0.2 and η = 0.5 ). By tuning the parameter η of the propose RCSSVM, we can derive the optimal solution. In the Table 2, we give the results when the parameter η = 0.2 and

ED

η = 0.5. However, when comparing the performance among different models, we should aggregate the performance under different parameters (η = 0.2 and

PT

η = 0.5) and take the highest accuracy. Thus, for the performance of RCSSVM, the number of times with the highest precision will be accumulated. Therefore, the performance of RCSSVM is better than the rest two models.

CE

For R, the proposed model is indeed slightly inferior to the CSSVM. This

is because the final decision is to minimize the misclassification cost over all the positive and negative class. In this experiment, the negative class has more

AC

sample points than the positive class. The results of R that are shown in Table 2 may indicate that the cost of one false negative sample point cannot catch up with the cost of ten or more false negative sample points. The results also give us insight that we can enhance penalties to false negative to increase the rate of R. We can also see that, the CSSVM model has the highest rate of

19

ACCEPTED MANUSCRIPT

RHSVM

RHSVM

RCSSVM

RCSSVM

kk=0.5

η=0.2

η=0.5

η=0.2

η=0.5

CMC

64.60

63.89

64.16

63.89

64.87

german

76.90

76.10

76.30

76.80

76.80

hepat

83.33

80.00

79.33

82.67

84.00

Ionosp

88.24

87.94

87.35

88.24

87.94

QSAR

87.12

86.73

86.54

87.31

87.79

wholesale

90.93

90.23

91.63

91.40

91.63

CMC

30.64

28.60

29.52

28.60

33.31

german

42.34

37.99

40.18

44.77

42.53

hepat

45.83

8.75

35.46

37.69

41.91

Ionosp

71.06

70.49

69.23

71.18

70.91

QSAR

70.41

70.40

70.31

70.63

72.89

wholesale

81.75

77.99

81.43

81.40

81.43

CMC

67.22

65.56

66.46

65.56

69.93

german

63.25

56.83

60.87

67.40

63.86

hepat

63.16

6.40

55.91

51.46

55.17

Ionosp

81.48

81.27

80.48

81.84

82.32

QSAR

83.75

84.16

84.26

83.77

85.63

wholesale

90.86

88.43

90.38

90.47

90.38

CMC

63.53

63.26

63.27

63.26

62.84

german

80.71

81.37

80.33

79.18

80.30

hepat

85.71

80.00

78.79

86.27

88.89

Ionosp

97.47

96.86

96.21

96.89

95.22

QSAR

90.42

89.14

88.64

90.85

89.95

wholesale

91.01

92.15

93.06

92.44

93.06

ACC

MCC

PT

ED

F1

AC

CE

R

AN US

Datasets

CR IP T

CSSVM

M

Table 2: Comparisons results of different evaluation criterions

R. That is, without noise, the CSSVM performs well, but our proposed models (RCSSVM(η = 0.2 and η = 0.5)) has the highest rate of ACC. From a compre-

hensive view, if we compare all the criterions, our model is much better. This result may of great significance especially for imbalanced datasets.

20

CR IP T

ACCEPTED MANUSCRIPT

Table 3: Comparisons of evaluation criterions for RHSVM and RCSSVM with η=0.2

Spectral region −1

k

ACC

MCC

F1

R

AAC

MCC

F1

R

4000-10000

90.00

80.04

89.83

91.38

97.50

95.05

97.54

95.97

0.5

9000-10000

64.58

29.29

62.88

66.06

67.92

40.72

74.07

62.15

0.01

8000-9000

65.83

31.74

64.66

66.96

65.00

35.46

72.37

59.78

0.01

8000-10000

64.17

28.34

63.87

64.41

76.67

59.18

80.82

68.60

0.01

6000-8000

77.08

54.86

78.76

73.38

87.92

78.15

89.22

80.54

0.01

5000-8000

80.83

61.88

81.60

78.46

89.58

80.94

90.57

82.76

0.01

4000-6000

86.25

72.70

86.75

83.72

93.33

87.45

93.75

88.24

0.01

AN US

)

RCSSVM

M

(cm

RHSVM

Table 4: Comparisons of evaluation criterions for RHSVM and RCSSVM with η=0.5

−1

(cm

)

4000-10000

RCSSVM

k

ACC

MCC

F1

R

AAC

MCC

F1

R

89.58

79.17

89.54

89.92

97.50

95.05

97.54

95.97

0.5

62.92

26.24

59.36

65.66

68.75

42.19

74.58

62.86

0.01

PT

9000-10000

RHSVM

ED

Spectral region

64.58

29.62

61.19

67.68

66.25

38.66

73.44

60.54

0.01

8000-10000

65.00

30.04

64.10

65.79

76.25

58.52

80.55

68.21

0.01

6000-8000

76.67

54.25

78.63

72.54

89.58

80.94

90.57

82.76

0.01

5000-8000

81.67

63.65

82.54

78.79

89.58

80.94

90.57

82.76

0.01

4000-6000

84.17

68.57

84.80

81.54

93.33

87.45

93.75

88.24

0.01

AC

CE

8000-9000

21

ACCEPTED MANUSCRIPT

4.2. Comparisons Results between RHSVM and RCSSVM in near-infrared spectroscopy data In this section, the proposed methods are used directly in a practical appli-

CR IP T

cation problem, to recognize the hardness of licorice seeds using near-infrared (NIR) spectroscopy data ([26]). Maize is the main grain crop in China, and

its yield was significantly related to the seed purity. The ”NongDa108” maize

hybrid seeds and ”mother178” seeds used in the experiments were harvested in Beijing, China, in 2008. A total of 240 seeds samples were selected in this exper-

iment,120 from hybrid seeds and 120 from mother seeds. After initial spectra

AN US

were digitized, the spectral data is measured at 1555 wavelength points in the

range of 4000 − 10, 000cm−1 . What’s more, all the datasets in this experiment share the same imbalance ratio. The positive to the negative ratio is 1:2. In this test, let k =

1 a,

where a = −1 − η1 ln(1 −

C−1 C1

+

C−1 C1 exp(−2ηC1 )).

We adjust C1 according to k by setting C1 = ea , therefore, we can regulate the distance to the negative hyperplane k. The results in Table 3 and Table 4 show

M

that the RCSSVM model perform far better than RHSVM by almost all of the evaluation criterions that are used in this text. The rightmost column gives the

ED

best selection of the parameter k. Other parameters are selected as section 4.1. From the result, we can see that the proposed robust loss function has obvious

PT

superiority. Thus, the validity of the model is further verified. 4.3. Comparisons Results between CSSVM and RCSSVM with noise

CE

For the robustness of the RHSVM model, a large amount of experiments are performed in [24]. But, the RHSVM model may not perform well when dealing with the imbalanced issue. In this subsection, we verified that the RCSSVM

AC

model is robust under cost sensitive conditions. In this test, two type of noise are added into the data sets which are label noise and feature noise. For the label noise, we randomly reverse 10% of the positive points to the negative class and reverse 10% of the negative points to the positive class in the datasets. For the feature noise, we randomly choose 10% points (10% out of positive class and 10% out of negative class) to add white Gaussian noise (the noise obeys 22

ACCEPTED MANUSCRIPT

a normal distribution whose mean is 0 and variance is 1). Table 5 and Table 6 show the average results of 5 times independent tests. The parameters of the RCSSVM model are selected as section 4.1. What’s more, we represent the

CR IP T

optimal parameters in Table 7. AS represented in Table 5 and Table 6, the performances of the propose

RCSSVM are better than CSSVM and RHSVM. In the CSSVM model, we fix kk =

1 2C−1 −1

= 0.5 and tune the parameter C1 from 0.8 to 2.5 every 0.1 steps.

In the RHSVM and RCSSVM models, we fix C−1 = 1 and tune the parameter C1 from 0.8 to 2.5 every 0.1 steps and tune the parameter η from 0.2 to 0.5. We

AN US

list the RHSVM( η = 0.2 and η = 0.5 ) and RCSSVM( η = 0.2 and η = 0.5 ). When comparing the performance among different models, we should aggregate the performance under different parameters and take the highest accuracy for fair. Thus, for the performance of RCSSVM, the number of times with the highest precision will be accumulated. Therefore, the performance of RCSSVM is better than the rest two models.the superiority of our model is outstanding

M

when dealing with the noisy imbalanced datasets.

To further illustrate the robustness of our proposed mode, we performed

ED

numerical experiments with 5 different standard deviation Gaussian noises with σ =0, 0.01, 0.1, 1, 5. The results are shown in Figure 2. From the results, we can see that with the standard deviation of the noise increases, ACC of all test

PT

models decrease. Our model performs superior results on CMC, Ionosp, QSAR and wholesale than other models, performs poor on heapt and equivalent on

AC

CE

german.

23

ACCEPTED MANUSCRIPT

RHSVM

RHSVM

RCSSVM

RCSSVM

kk=0.5

η=0.2

η=0.5

η=0.2

η=0.5

CMC

63.98

63.89

63.10

64.25

64.25

german

76.90

75.70

76.20

76.70

76.20

hepat

85.33

82.00

80.00

83.33

84.00

Ionosp

87.35

85.59

86.47

87.94

87.65

QSAR

86.63

86.92

87.12

87.40

87.21

wholesale

91.40

90.70

90.93

91.86

92.09

CMC

29.55

27.79

26.20

30.01

31.09

german

44.12

35.94

37.87

41.74

38.08

hepat

50.92

43.64

12.51

38.99

40.31

Ionosp

69.65

65.44

67.16

70.38

69.60

QSAR

69.50

70.09

70.31

71.93

70.67

wholesale

81.07

78.73

80.03

82.45

82.48

CMC

66.87

63.91

63.11

67.02

68.41

66.24

53.41

56.11

62.69

56.63

hepat

66.67

62.50

12.31

51.76

52.07

Ionosp

81.55

78.08

78.71

80.90

79.95

QSAR

83.32

83.62

83.64

85.05

83.93

wholesale

90.22

88.75

89.66

91.02

90.95

CMC

62.90

63.89

63.09

63.17

62.68

german

79.71

81.75

81.90

80.51

81.69

hepat

88.89

83.33

80.00

88.00

89.80

Ionosp

94.60

94.22

96.07

97.44

98.02

QSAR

89.79

90.12

90.54

89.71

90.46

wholesale

92.72

92.84

92.33

92.83

93.45

CE AC

R

M

german

PT

F1

ED

MCC

AN US

CSSVM

Datasets

ACC

CR IP T

Table 5: experimental results with label noise

24

ACCEPTED MANUSCRIPT

RHSVM

RHSVM

RCSSVM

RCSSVM

kk=0.5

η=0.2

η=0.5

η=0.2

η=0.5

CMC

64.34

63.27

63.45

64.16

64.96

german

77.20

76.30

76.00

77.10

76.50

hepat

84.00

82.67

83.33

84.67

84.00

Ionosp

87.35

85.88

87.35

87.94

87.35

QSAR

86.44

86.83

86.35

87.40

87.69

wholesale

91.16

90.47

91.16

91.40

91.63

CMC

31.31

26.13

27.10

29.12

35.12

german

45.57

37.96

38.24

44.07

43.45

hepat

54.34

32.67

45.83

53.44

40.31

Ionosp

69.01

66.75

69.01

71.86

69.23

QSAR

69.08

69.43

68.72

74.40

74.06

wholesale

80.71

77.77

80.07

81.40

81.43

CMC

68.53

62.20

63.89

65.79

71.32

67.82

55.89

57.80

65.73

66.02

hepat

72.73

41.03

63.16

70.94

52.07

Ionosp

79.73

80.15

79.73

83.98

80.48

QSAR

83.07

82.99

82.78

87.12

86.71

wholesale

90.06

88.07

89.57

90.47

90.38

CMC

62.74

63.67

63.29

63.52

62.44

german

79.73

82.24

80.77

80.23

79.07

hepat

85.39

88.89

85.71

86.75

89.80

Ionosp

97.38

92.30

97.38

92.84

96.21

QSAR

89.60

90.57

89.68

87.67

88.65

wholesale

92.39

93.09

92.95

92.44

93.06

CE AC

R

M

german

PT

F1

ED

MCC

AN US

CSSVM

Datasets

ACC

CR IP T

Table 6: experimental results on different level of feature noise

25

26

feature noise

label noise

noise free

(C, C1 ) = (10, 1.3) (C, C1 ) = (1000, 1) (C, C1 ) = (0.1, 1.7) (C, C1 ) = (10, 1.2) (C, C1 ) = (1000, 1.3) (C, C1 ) = (1, 2.2) (C, C1 ) = (100, 0.9) (C, C1 ) = (1000, 0.9) (C, C1 ) = (0.1, 1.8)

QSAR

wholesale

CMC

german

hepat

Ionosp

QSAR

wholesale

(C, C1 ) = (0.01, 1.1)

CMC

Ionosp

(C, C1 ) = (0.01, 2.5)

wholesale

(C, C1 ) = (10, 1.7)

(C, C1 ) = (10, 0.9)

QSAR

hepat

(C, C1 ) = (10, 0.8)

Ionosp

(C, C1 ) = (0.01, 1.5)

(C, C1 ) = (0.1, 1.3)

hepat

german

(C, C1 ) = (1, 1.1)

ED

(C, C1 ) = (0.01, 1.1)

M

C = 0.1

C = 10

C = 100

C = 0.1

C = 1000

C = 0.1

C = 0.1

C=1

(C, C1 ) = (0.001, 1.6)

(C, C1 ) = (10, 0.8)

(C, C1 ) = (10, 0.8)

(C, C1 ) = (0.01, 1.7)

(C, C1 ) = (0.1, 1.3)

(C, C1 ) = (0.01, 1)

η=0.2

RCSSVM

C = 0.01

C = 10

C=1

C = 100

C = 100

C = 0.1

C = 100

C = 10

C = 0.1

C = 0.01

C = 100

C = 100

(C, C1 ) = (0.01, 1.5)

(C, C1 ) = (1, 1)

(C, C1 ) = (1, 1.2)

(C, C1 ) = (0.01, 1.5)

(C, C1 ) = (0.1, 1.1)

(C, C1 ) = (100, 1.1)

(C, C1 ) = (0.001, 0.9)

(C, C1 ) = (1000, 1.1)

(C, C1 ) = (100, 0.9)

(C, C1 ) = (0.01, 1.4)

(C, C1 ) = (0.1, 1.1)

(C, C1 ) = (100, 1.2)

η=0.5

RCSSVM

(C, C1 ) = (0.01, 1.5)

(C, C1 ) = (1000, 1.7)

(C, C1 ) = (1000, 1.8)

(C, C1 ) = (1, 1.6)

(C, C1 ) = (1, 1.2)

(C, C1 ) = (0.01, 1)

(C, C1 ) = (0.001, 0.9)

(C, C1 ) = (10, 1.5)

(C, C1 ) = (1, 1.1)

(C, C1 ) = (0.01, 1.5)

(C, C1 ) = (100, 1.4)

(C, C1 ) = (0.1, 1.3)

CR IP T

(C, C1 ) = (0.001, 2.3)

(C, C1 ) = (1, 1.3)

(C, C1 ) = (0.1, 1.2)

(C, C1 ) = (0.01, 1.8)

(C, C1 ) = (0.1, 1.2)

(C, C1 ) = (0.1, 1.1)

AN US

C = 1000

C = 100

C = 100

C = 0.01

C = 0.1

C = 0.001

C = 10

C = 1000

C = 10

C=1

C=1

C = 0.1

η=0.5

RHSVM

C = 10

C = 0.01

C = 0.1

C = 0.01

η=0.2

RHSVM

Table 7: optimal parameter in the experiments

german

CMC

kk=0.5

CSSVM

PT

CE

Datasets

AC

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIPT

65

78

64

77

63

76

75 CSSVM RHSVM1 RHSVM2 RCSSVM1 RCSSVM2

60

73

59

72

58

57

74

71

0

0.01

0.1 Standard deviation of noise σ

1

70

5

0

0.01

(a)

84

87.5

86.5

82

86 81 85.5 ACC

ACC

80 85

79 84.5 78

CSSVM RHSVM1 RHSVM2 RCSSVM1 RCSSVM2

76

75

0

0.01

0.1 Standard deviation of noise σ

5

(b)

83

1

5

82.5

0

0.01

(c)

0.1 Standard deviation of noise σ

1

5

(d)

95

CSSVM RHSVM1 RHSVM2 RCSSVM1 RCSSVM2

87

86.5

90

86

85

ACC

85.5 ACC

1

83.5

87.5

85

84.5

80

83.5

83

0.01

0.1 Standard deviation of noise σ

1

5

(e)

CSSVM RHSVM1 RHSVM2 RCSSVM1 RCSSVM2

75

70

0

0.01

0.1 Standard deviation of noise σ

1

5

(f)

ED

0

M

84

82.5

AN US

84 77

0.1 Standard deviation of noise σ

CSSVM RHSVM1 RHSVM2 RCSSVM1 RCSSVM2

87

83

CR IP T

61

ACC

ACC

62

CSSVM RHSVM1 RHSVM2 RCSSVM1 RCSSVM2

Figure 3: classifier performances of ACC with the change of σ measured by CSSVM, RHSVM1(η = 0.2) and RHSVM2 (η = 0.5), RCSSVM1 (η = 0.2) and RCSSVM2 (η = 0.5) model on different data sets: (a) CMC; (b) german; (c) heapt; (d) Ionosp; (e) QSAR; (f)

PT

wholesale

CE

5. Conclusion

In the binary classification problem, cost sensitive classification system as

a Bayes consistent classifier is usually an effective way to handle imbalanced

AC

problems. However, when the train sets exist noise, the proposed model demonstrated its advantages. The main contributions can be summarized as follows: (1) According to Bayes Theory, a new non-convex robust loss function which

is Fisher consistent is designed to deal with the imbalanced classification problem when there exists noise.

27

ACCEPTED MANUSCRIPT

(2) By applying this new loss function in SVM framework, a non-convex robust classifier is derived which is called robust cost sensitive support vector machine (RCSSVM). To obtain the optimal solution and reduce the algorithm

CR IP T

complexity, corresponding alternative iterative algorithm is designed. What is worth mentioning is that for label and feature noise in the training data sets, the weight of this kind of points is under control in the proposed loss function

and the model is equivalent to a re-weighted SVM. This causes the proposed model insensitive to noise in theory.

(3) In the label noise environment and feature noise environment, comparing

AN US

numerical experiments of the CSSVM, RHSVM and RCSSVM are performed on UCI datasets as well as 1 application dataset respectively. To further test the robustness of the model, three different intensity noise levels are considered. The results show that the proposed RCSSVM improved the accuracy of minority class while having robustness and thus, the robustness of the proposed model is

ED

Acknowledgements

M

verified.

This work is supported by National Nature Science Foundation of China

PT

(11471010,11271367) and Chinese Universities Scientific Fund.

CE

References

AC

[1] Akbani, R., Kwek, S., Japkowicz, N., 2004. Applying Support Vector Machines to Imbalanced Datasets. Springer Berlin Heidelberg.

[2] Blake, C., 1998. Uci repository of machine learning databases . [3] Duan, W., Jing, L., Lu, X.Y., 2014. Imbalanced data classification using cost-sensitive support vector machine based on information entropy. Advanced Materials Research 989-994, 1756–1761. 28

ACCEPTED MANUSCRIPT

[4] Ezawa, K.J., Singh, M., Norton, S.W., 1996.

Learning goal oriented

bayesian networks for telecommunications risk management, in: Thirteenth International Conference on International Conference on Machine Learn-

CR IP T

ing, pp. 139–147. [5] Fawcett, T., 2006. An introduction to roc analysis. Pattern Recognition Letters 27, 861–874.

[6] He, R., Zheng, W.S., Tan, T., Sun, Z., 2014. Half-quadratic based itera-

tive minimization for robust sparse representation. IEEE Transactions on

AN US

Pattern Analysis & Machine Intelligence 36, 261.

[7] Hong, C., Yu, J., Tao, D., Wang, M., 2015a. Image-based three-dimensional human pose recovery by multiview locality-sensitive sparse retrieval. IEEE Transactions on Industrial Electronics 62, 3742–3751.

[8] Hong, C., Yu, J., You, J., Chen, X., 2015b. Hypergraph regularized au-

M

toencoder for 3d human pose recovery , 132–140.

[9] Japkowicz, N., Stephen, S., 2002. The class imbalance problem: A system-

ED

atic study. Intelligent Data Analysis 6, 429–449. [10] Karakoulas, G., Shawe-Taylor, J., 1999. Optimizing classifiers for imbal-

PT

anced training sets, in: Conference on Advances in Neural Information Processing Systems II, pp. 253–259.

CE

[11] Katsumata, S., Takeda, A., 2015. Robust Cost Sensitive Support Vector Machine , 434–443.

AC

[12] Kim, J., Choi, K., Kim, G., Suh, Y., 2012. Classification cost: An empirical comparison among traditional classifier, cost-sensitive classifier, and metacost. Expert Systems with Applications 39, 4013–4019.

[13] Kubat, M., Matwin, S., 1997. Addressing the curse of imbalanced training sets: One-sided selection, in: International Conference on Machine Learning, pp. 179–186. 29

ACCEPTED MANUSCRIPT

[14] Lin, Y., 2004. A note on margin-based loss functions in classification. Statistics and Probability Letters 68, 73–82. [15] Liu, W., Pokharel, P.P., Principe, J.C., 2007. Correntropy: Properties

Signal Processing 55, 5286–5298.

CR IP T

and applications in non-gaussian signal processing. IEEE Transactions on

[16] Masnadi-Shirazi, H., Vasconcelos, N., 2008. On the design of loss functions for classification: theory, robustness to outliers, and savageboost, in:

Conference on Neural Information Processing Systems, Vancouver, British

AN US

Columbia, Canada, December, pp. 1049–1056.

[17] Masnadishirazi, H., Vasconcelos, N., Iranmehr, A., 2012. Cost-sensitive support vector machines .

[18] Menglong Yang, Y.L., You, Z., 2017. The euclidean embedding learning

ing 267, 195–200.

M

based on convolutional neural network for stereo matching. Neurocomput-

[19] Pant, R., Trafalis, T.B., Barker, K., 2011. Support vector machine clas-

ED

sification of uncertain and imbalanced data using robust optimization, in: Wseas International Conference on Computers, pp. 369–374.

PT

[20] Qian, S., Liu, H., Liu, C., Wu, S., Wong, H.S., 2018. Adaptive activation functions in convolutional neural networks. Neurocomputing 272, 204–212.

CE

[21] Raskutti, B., Kowalczyk, A., 2004. Extreme re-balancing for SVMs: a case study. ACM.

AC

[22] Riccardi, A., Fern´ andeznavarro, F., Carloni, S., 2014. Cost-sensitive adaboost algorithm for ordinal regression based on extreme learning machine. IEEE Transactions on Cybernetics 44, 1898.

[23] Tao, D., Hong, C., Yu, J., Wan, J., Wang, M., 2015. Multimodal deep autoencoder for human pose recovery. IEEE Transactions on Image Processing A Publication of the IEEE Signal Processing Society 24, 5659–5670. 30

ACCEPTED MANUSCRIPT

[24] Xu, G., Cao, Z., Hu, B.G., Principe, J.C., 2017. Robust support vector machines based on the rescaled hinge loss function. Pattern Recognition 63, 139–148.

anced data :connecting cluster to classification .

CR IP T

[25] Yan, Q., Xia, S., Meng, F., 2017. Optimizing cost-sensitive svm for imbal-

[26] Yang, L., Sun, Q., 2012. Recognition of the hardness of licorice seeds using a semi-supervised learning method. Chemometrics and Intelligent Laboratory Systems 114, 109–115.

AN US

[27] Young, C.M., Liu, C.C., Liu, C.H., 1996. New inverter-driven design and control method for two-phase induction motor drives. Electric Power Applications, IEE Proceedings 143, 458–466.

[28] Yu, J., Kuang, Z., Zhang, B., Zhang, W., Lin, D., Fan, J., 2018. Leveraging content sensitiveness and user trustworthiness to recommend fine-grained

M

privacy settings for social image sharing. IEEE Transactions on Information Forensics & Security 13, 1317–1332.

ED

[29] Yu, J., Zhang, B., Kuang, Z., Lin, D., Fan, J., 2017. iprivacy: Image privacy protection by identifying sensitive objects via deep multi-task learning.

PT

IEEE Transactions on Information Forensics & Security 12, 1005–1016. [30] Zadrozny, B., Langford, J., Abe, N., 2003. Cost-sensitive learning by costproportionate example weighting, in: IEEE International Conference on

CE

Data Mining, pp. 435–442.

AC

[31] Zhang, J., Yu, J., Tao, D., 2018. Local deep-feature alignment for unsupervised dimension reduction. IEEE Transactions on Image Processing 27, 1–14.

[32] Zhang, L., Zhang, D., 2015. Evolutionary cost-sensitive extreme learning machine. IEEE Transactions on Neural Networks and Learning Systems 28, 3045–3060.

31

ACCEPTED MANUSCRIPT

[33] Zhang, T., 2004. Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics 32, 56–84. [34] Zheng, E., Zhang, C., Liu, X., Lu, H., Sun, J., 2013. Cost-sensitive extreme

CR IP T

learning machine, in: International Conference on Advanced Data Mining and Applications, pp. 478–488.

[35] Zhu, C., Wang, Z., 2017. Entropy-based matrix learning machine for imbalanced data sets. Pattern Recognition Letters 88, 72–80.

[36] Zong, W., Huang, G.B., Chen, Y., 2013. Weighted extreme learning ma-

AN US

chine for imbalance learning. Neurocomputing 101, 229–242.

Liming Yang is a Professor at College of Science, China Agricultural University. Currently her research interests include optimization theory and method, machine learning

M

and data mining.

ED

Wang received master degree in mathematics and information science from Hebei University from 2013 to 2016. She is currently a Ph.D. student in operations and management in

PT

China Agricultural University, China. Her research interests include neural networks, pattern recognition, and machine

AC

CE

learning.

32