- Email: [email protected]

JID: NEUCOM

[m5G;June 8, 2018;21:49]

Neurocomputing 0 0 0 (2018) 1–13

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Ramp loss one-class support vector machine; A robust and effective approach to anomaly detection problems Yingjie Tian a,b, Mahboubeh Mirzabagheri a,b,c, Seyed Mojtaba Hosseini Bamakan b,c,d,∗, Huadong Wang a,b, Qiang Qu d a

Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Beijing 100190, China Research Center on Fictitious Economy and Data Science, Chinese Academy of Sciences, Beijing 100190, China School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China d Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China b c

a r t i c l e

i n f o

Article history: Received 28 October 2017 Revised 12 February 2018 Accepted 7 May 2018 Available online xxx Communicated by Dr Zhao Zhang Keywords: Anomaly detection One-class SVM Ramp loss function Non-convex problem

a b s t r a c t Anomaly detection deﬁnes as a problem of ﬁnding those data samples, which do not follow the patterns of the majority of data points. Among the variety of methods and algorithms proposed to deal with this problem, boundary based methods include One-class support vector machine (OC-SVM) is considered as an effective and outstanding one. Nevertheless, extremely sensitivity to the presence of outliers and noises in the training set is considered as an important drawback of this group of classiﬁers. In this paper, we address this problem by developing a robust and sparse methodology for anomaly detection by introducing Ramp loss function to the original One-class SVM, called “Ramp-OCSVM”. The main objective of this research is to taking the advantages of non-convexity properties of the Ramp loss function to make robust and sparse semi-supervised algorithm. Furthermore, the Concave–Convex Procedure (CCCP) is utilized to solve the obtained model that is a non-differentiable non-convex optimization problem. We do comprehensive experiments and parameters sensitivity analysis on two artiﬁcial data sets and some chosen data sets from UCI repository, to show the superiority of our model in terms of detection power and sparsity. Moreover, some evaluations are done with NSL-KDD and UNSW-NB15 data sets as well-known and recently published intrusion detection data sets, respectively. The obtained results reveal the outperforming of our model in terms of robustness to outliers and superiority in the detection of anomalies. © 2018 Elsevier B.V. All rights reserved.

1. Introduction The history of anomaly detection or outlier detection can be traced back to the studies have been done by the statistic community at the beginning of nineteen century [13]. Because of the importance of anomaly detection during the time, many researchers from various domains have noted this problem and a broad range of techniques from generic to speciﬁc methods have been proposed [6]. In [6], the authors deﬁned the anomaly detection as a problem of ﬁnding those data samples which do not follow the patterns of the majority of data points. In fact, anomaly detection is the problem of distinguishing between normal data points with the well-

∗ Corresponding author at: Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China E-mail addresses: [email protected] (Y. Tian), [email protected] (M. Mirzabagheri), [email protected], [email protected] (S.M.H. Bamakan), [email protected] (H. Wang), [email protected] (Q. Qu).

deﬁned patterns or signatures and those that do not conform to the expected proﬁles. Anomaly detection has a wide range of applications such as, fraud detection [36,48], healthcare monitoring [3], fault detection [11,55], event detection [36] and intrusion detection [25,27,32,35,39,46]. Although in some cases, the anomalies and outliers are considered as the same concept and sometimes interchangeably. In our research, we distinguish them in such a way that we want to reduce the impact of outliers and noises in the training set on the proposed method to have a better detection of abnormal classes. The focus of our research is on those anomalies that occur in computer networks. The main challenges in this ﬁeld include the massive volumes of network traﬃc, the streaming nature of trafﬁc data, a high number of false alarm rate and lack of labeled data for the attacks. Among the aforementioned challenges, availability of labeled data is considered as a signiﬁcant factor that affects the chosen technique to be a supervised, semi-supervised or unsupervised. Since preparing labels for attack classes are costly, semi-supervised and unsupervised techniques become more favor-

https://doi.org/10.1016/j.neucom.2018.05.027 0925-2312/© 2018 Elsevier B.V. All rights reserved.

Please cite this article as: Y. Tian et al., Ramp loss one-class support vector machine; A robust and effective approach to anomaly detection problems, Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.05.027

JID: NEUCOM 2

ARTICLE IN PRESS

[m5G;June 8, 2018;21:49]

Y. Tian et al. / Neurocomputing 000 (2018) 1–13

able in anomaly detection problem [6,53]. In fact, the normal class is widely available and this problem can be formulated as a oneclass classiﬁcation problem. As deﬁned by Giacinto et al. [17], One-class classiﬁcation problem refers to those two-class problems which the main class is the well-sampled, or in other words, it has a well-deﬁned signature, whereas the other one is severely undersampled because of its extremely diverse nature and also the diﬃculty in obtaining a signiﬁcant number of clear and well-deﬁned training patterns. The main objective of one-class classiﬁcation technique is to distinguish between a set of target objects and the rest of existing objects, which are deﬁned as anomalies or outliers [17]. It means in this case the machine learning techniques are trained just based on one class and the task is to determine whether a new test data is the member of this speciﬁc class or not. This situation can be found in many cases such as fault detection in an industrial process, or anomaly detection in network traﬃc analysis. Many researchers address this problem and vast ranges of techniques have been proposed as they can be categorized into three groups; density methods such as Parzen window estimator [12], reconstruction methods such as k-means [24] and boundary methods such as One-class SVM [41] and Support Vector Data Description (SVDD) [44]. By considering the boundary based classiﬁcation methods, in a two-class classiﬁcation problem, the basic idea of SVM is to establish a hyperplane to separate these two classes of objects with a maximum margin [19,51,52,54]. However, in a case of a one-class problem, a separating hyperplane is constructed in such a way that has the maximum margin between the normal data points and the origin [41]. A new data sample will be classiﬁed, as a normal one if it is located within the boundary and conversely, it would detect as an abnormality when it lies outside of the boundary [54]. The focus of our research is on one-class SVM technique as a brief review is presented in Section 2. Although One-class SVM (OC-SVM) is considered as an effective and outstanding classiﬁcation technique for anomaly detection problems, this classiﬁer suffers from the problem of sensitivity to the presence of outliers and noises in the training sets. Here, outliers refer to those data points that deviate from the majority of the others. In the real world problem, data sets mostly contain outliers because of some reasons such as instrument failure, formatting errors and non-representative sampling. The sensitivity of OC-SVM to the outliers comes from the convex property of the Hinge loss function. It causes as far as outliers are from the decision boundary, they get larger losses. Therefore, the outliers will shift the decision boundary toward themselves and as a result, it decreases the generalization power of OC-SVM. The main contributions of this paper are proposed as follows: (a) Ramp loss One-class SVM is developed as a robust and sparse anomaly detection methodology. (b) Since Ram-OCSVM is a non-differentiable non-convex optimization problem, the Concave–Convex Procedure (CCCP) is introduced to solve it. (c) The eﬃciency and robustness of the proposed method have been examined by different data sets including artiﬁcial data, some UCI benchmark data sets, and two network anomaly detection data sets. 2. Related works Since the time of introducing the ﬁrst model of One-class SVM by Scholkopf et al. [41] in 2001, many researchers have attempted to introduce some improvements to the basic model [14,26,31]. Guillermo et al. [18] proposed a modiﬁed version of OC-SVM to deal with the abrupt change detection problem. The proposed

method is based on ﬁnding the area of the input space somewhere most of the data points are located. Since this area is changing with the time windows, the model is named “One-class TimeAdaptive SVM”. A weighted version of OC-SVM is proposed by Bicego and Figueiredo [3] named “WOC-SVM”. In their research, a weight factor which represents the importance of corresponding data point has been used to train the model. The authors utilized the WOC-SVM to introduce a soft clustering algorithm. In [57], the authors tried to further improve the WOC-SVM by introducing a novel instance-weighted strategy. In their method k-nearest neighbor is used to denote a weight to those data samples which are near the boundary of the training set. Higher weights are assigned to the data samples which are close to the boundary of the data distribution and conversely lower weights are assigned to the data samples which located in the interior of the training set. In speciﬁc, some researchers forced on the sensitivity of OCSVM to the outliers and the modiﬁcations are proposed to make this classiﬁer more robust. Amer et al. [1] in 2013 proposed two enhanced versions of OC-SVM called them “robust one-class SVM” and “eta one-class SVM” to make it more effective in unsupervised anomaly detection problem. In the former model, the authors did some modiﬁcation respect to the slack variables, in such a way that they are proportional to the distance to the centroid of the kernel space. On the other hand, in eta OC-SVM the number of non-zero slack variables which contribute to minimize the objective function, is controlled by introducing 0–1 variable ηi . Here, if the η is equal to 1, it refers to normal data samples, and for the outliers η will be equal to 0 [1]. Since the objective function of this problem consists of a convex quadratic problem and a linear problem, so the objective is not jointly convex. Thus, it needs to be relaxed by a semi-deﬁnite programming problem. According to the Amer et al. [1], eta OC-SVM shows promising results compared to the other methods in terms of sparsity and area under the curve (AUC). In 2014, Yin et al. [54] also addressed the sensitivity of the OCSVM to the noises and outliers in the training sets in the context of the fault detection problem. Therefore, to depress the pressure of these points on the decision function of OC-SVM, an adaptive penalty factor is introduced to develop a robust one class SVM. In the OC-SVM, a slack variable ξ i is introduced to allow some data points to locate outside of the boundary. And the number of these points are controlled by the penalty factor 1/vl. It means that the possibility of locating data points outside of the decision boundary will be increased if the value of the penalty factor is small. However, in the proposed robust OC-SVM, the penalty factor is adjusted by taking into account the distances between the data samples and the center of the data set [54]. Motivated by our previous research that we develop a precise, sparse and robust methodology for multi-class intrusion detection problem based on the Ramp Loss K-Support Vector ClassiﬁcationRegression, named “Ramp-KSVCR” [20] and the aforementioned works, we addressed the sensitivity of One-class SVM to the outliers by introducing a non-convex loss in order to reduce the impact of unexpected points in the data sets. However, during the reviewing process of this paper, we found that the similar idea is used by Xiao et al. [50]. Although, both of ideas are developing the Ramp loss based OC-SVM, the approaches to formulating and considering this problem is different. In addition to develop the proposed model based on sequential minimal optimization (SMO) [7], in order to make our model more applicable in the large-scale setting, Alternating Direction Method of Multipliers (ADMM) is used to solve sub-quadratic programming problems in each iteration of CCCP [4,20,47]. Moreover, the main focus of this paper is to address the problem of computer network anomaly detection. Hence, besides to do the comprehensive experiments and perform parameters sensitivity analysis on two artiﬁcial data sets and some general anomaly detection data sets, to show the superiority of our

Please cite this article as: Y. Tian et al., Ramp loss one-class support vector machine; A robust and effective approach to anomaly detection problems, Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.05.027

ARTICLE IN PRESS

JID: NEUCOM

[m5G;June 8, 2018;21:49]

Y. Tian et al. / Neurocomputing 000 (2018) 1–13

3

model in terms of detection power and sparsity, we further examined and compared our method against some of the intrusion detection data sets as discussed in Section 6. 3. Background 3.1. One-class support vector machine Let’s suppose a training set deﬁnes as xi ∈ Rn , i = 1, . . . , l where xi = (xi1 , . . . , xin ) is the normal data in an n-dimensional real space Rn ,yi is the corresponding output of xi and l is the number of samples. In the rest, the boldface Greek letters denote n-dimensional vectors whose components are labeled using a normal typeface. Since in some cases, the separating hyperplane cannot be found in the original real space, a mapping function φ (x) is deﬁned to map the data samples into an inner product space F in such way T that K (xi , x j ) = φ (xi ) φ (x j ). As discussed by Schölkopf and Platt [41], the Gaussian kernel Eq. (1.1) is the kernel which guarantees the existence of possible solutions.

Gaussian Kernel : K (x, y ) = φ (xi )T

φ (x j ) = exp−γ x−

y22

(1.1)

Fig. 1. The geometry of one-class SVM hyperplane in feature space.

α i , β i ≥ 0, its Lagrangian equation is given by: L ( w , ξ , ρ , α, β ) =

l 1 1 ξi − ρ w22 + 2 lν i=1

By considering the above deﬁnition, the problem of ﬁnding the spearing hyperplane is formulated as following quadratic program:

min

w,ξ ,ρ

s.t.

1 T w w 2

− ρ + v1l

l i=1

w T φ ( x i ) ≥ ρ − ξi , ξi ≥ 0 , i = 1 , . . . , l

ξi

ξ i = ( ξ1 , ξ2 , . . . , ξ l )

(1.2)

(1.3)

Where φ (x) is a mapping function, which in the nonlinearly separable cases, it would maps the data samples from the original input space into a higher dimensional feature space by applying some kernel functions K(xi , xj ). Therefore, the mapping cause that the training samples can be linearly separable, where (xi , x j ) in the input space changed to form of φ (xi )T φ (xj ) in the feature space, i.e. K (xi , x j ) = φ (xi ) φ (x j ). The geometric illustration of One-class SVM in presented in Fig. 1. By considering Fig. 1, the value of f(x) for a new data point x, determine whether this point should be classiﬁed as a normal or an anomaly by examining which side of the hyperplane it falls on in feature space. Here, the distance from any point like xi to a hyperplane f (x ) = (w.φ (x ) − ρ ) is deﬁned as d (x ) = |w.φ (x ) − ρ|/ w . Here, in the case of One-class SVM, by considering the origin as x0 , this distance would be ρ /w. So the objective function which is going to maximize this margin can be rewritten as minimization form 2 such as 1/2 w2 − ρ . However, in the objective function besides to maximize the margin, the average of the slack variables should be minimized [1]. To ﬁnd the solution for the optimization problem (1.2), we need to derive its dual problem. So by introducing Lagrange multipliers T

αi (w.φ (xi ) − ρ + ξi ) −

i=1

where l is the number of the data points and ξi = (ξ1 , ξ2 , . . . , ξl ) are the non-zero slack variables that are penalized in the objective function and w is a weight vector of the same dimension as the feature space φ (xi ). By introducing slack variables we allow some xi to locate outside of the decision boundary during the training phase. ν ∈ (0, 1] is the regularization term that controls the tradeoff between maximizing the distance from the origin and containing most of the data in the region created by the hyperplane. The value of ν corresponds to the ratio of outliers in the training sets. w and ρ are the parameters which determine the decision boundary and they are target variables of the optimization problem. The decision boundary can be formulated as:

f (x ) = sgn(w.φ (x ) − ρ )

−

l

l

βi ξi

(1.4)

i=1

w and α can be formulated as following by calculate the partial derivatives of the Lagrangian equation with respect to w, ξ and ρ are set to zero.

w=

l

αi φ (xi )

(1.5)

i=1

αi = l

1 − βi , i = 1, 2, ..., l lν

(1.6)

αi = 1

(1.7)

i=1

Based on the Karush–Kuhn–Tucker (KTT) conditions and substitute Eqs. (1.5)–(1.7) to Eq. (1.4) and also by considering the kernel function Qi j = K (xi , x j ) = φ (xi )T φ (x j ), the dual problem is formulated as follows:

min α

s.t.

αT Q α 0 ≤ αi ≤ (1/vl ), i = 1, . . . , l eT α = 1, 1 2

(1.8)

Where eT deﬁned as a vector of ones. According to Eq. (1.5) all the data samples which {xi : i ∈ l, α i > 0} considered as support vectors (SVs). Based on the KKT conditions, the other possible situation deﬁned as αi = 0 represents the data point within the boundary, 0 < α i < 1/lν includes the data points on the decision boundary and αi = 1/l ν are those points outside of boundary also so-called outliers. By applying the kernel functions, the decision function in the non-linear form is changed to Eq. (1.9):

f (x ) = sgn

l

αi K (xi , x j ) − ρ

(1.9)

i=1

By solving Eq. (1.8), the solution α can be gained and ρ is obtained based on the following formula Eq. (1.10):

ρ=

1 α j k x j , x where αi ∈ 0, lν j=1

l

(1.10)

Please cite this article as: Y. Tian et al., Ramp loss one-class support vector machine; A robust and effective approach to anomaly detection problems, Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.05.027

ARTICLE IN PRESS

JID: NEUCOM 4

[m5G;June 8, 2018;21:49]

Y. Tian et al. / Neurocomputing 000 (2018) 1–13

By using 1 /vl α to replace with α in Eq. (1.8) we can obtain a scaled version of (1.8)

min α

s.t.

1 T α Qα 2 0 ≤ αi ≤ 1, i = 1, . . . , l e

T

(1.11)

α = vl,

If α is optimal for the dual problem (1.8) and ρ is optimal for the primal problem (1.2), by solving (1.11) obtains ρ¯ = vl ρ and αi = vlαi , we also have αi /ρ = αi /ρ . 3.2. Hinge loss function deﬁnition Fig. 2. The Hinge loss (a) and the Ramp loss (b) for One-class SVM.

Most of the real world’s data sets consist of some unexpected data samples as an outlier. The presence of outliers in the training sets cause the decision hyperplane of SVM classiﬁer inappropriately shifts toward outliers. In fact, in this case the outliers will gain the largest margin losses. The extreme sensitivity of SVMs to the presence of noises and outliers in the training set is because of the convex property of the Hinge loss function, which implemented in the standard SVM [8,22,30]. The Hinge loss function is deﬁned in Eq. (1.12).

Hs (z ) = max(0, s − z )

(1.12)

Where the subscript s indicates the position of the Hinge point to penalize the data samples, and the loss function will generate a loss if the scores z be smaller than a predeﬁned value s (s < 1). Eq. (1.13) presents the formulation of the primal problem of standard SVM:

min w,b

1 w22 + C 2

l

H1 (yi f (xi ))

avoid the outliers no longer have been considered as support vectors, we implement the Ramp loss function Rρ , s (z) instead of the Hinge loss function. Accordingly, as shown in Fig. 2(b), for those z ≤ ρ − s the Ramp loss will be ﬂat and its value would be a constant s. Based on the above deﬁnition, the Ramp loss function for OCSVM is deﬁned as Eq. (1.15) and Eq. (1.14) can be rewritten as Eq. (1.16):

⎧ ⎨ 0, ρ − z, R p, s (z ) = ⎩ s,

Where f(xi ) is the decision function with the form of f (xi ) = (w.φ (x ) + b) and φ (x) is the chosen feature map. It is obvious that SVM classiﬁer is not a good chose to handle the large-scale problems because of its heavily computational cost. The number of Support Vectors (SVs) increase with the number of samples and it makes the SVM training and recognition times increase quickly with the number of SVs [8]. In order to make SVM more eﬃcient in dealing with large data sets, and to build a sparse and robust SVM, some non-convex losses are introduced to the SVM [20].

(1.15)

l 1 1 || w||22 + Rρ ,s (g(xi )) − ρ w,ρ ,s 2 vl

(1.16)

i=1

The Rρ , s (z) can be broken down into the sum of the convex Hinge loss function and concave function, i.e. Rρ ,s (z ) = Hρ (z ) − Hρ −s (z ). Therefore, Eq. (1.16) of Ramp loss One-class SVM can be reformulated as: l l 1 1 1 || w||22 + Hρ (g(xi )) − ρ − Hρ −s (g(xi )) w,ρ ,s 2 vl vl

min

i=1

(1.17)

i=1

The objective function of above optimization problem is considered as the sum of the convex part and concave part as presented in the following: The convex part;

4. Proposed methodology

fvex (w, ρ ) =

4.1. Ramp-OCSVM

ρ −s

min

(1.13)

i=1

z≥ρ

l 1 1 ||w||22 + Hρ (g(xi )) − ρ 2 vl

(1.18)

i=1

The effectiveness of Ramp loss based SVMs is examined in some works such as [5,8,20,22,49]), that motivated by them in this paper Ramp loss OC-SVM is developed and its performance is examined in the context of anomaly detection problem. By deﬁning the Hinge loss function as Hρ (z ) = max(0, ρ − z ), and assume that g(x ) = wT φ (x ), the primal model of One-class SVM is equivalent to the following form: l 1 1 min ||w||22 + Hρ (g(xi )) − ρ w,ρ 2 vl

(1.14)

i=1

By considering Fig2(a) as the Hinge loss of One-class SVM, it is obvious that if a data sample falls above the separating hyperplane, the value of g(x) will be greater-equal to ρ in fact wT φ (x) ≥ ρ , so Hρ (z ) = 0 which means that there is no penalty for this sample. On the other hand, if some data points fall under the separating hyperplane then wT φ (x) < ρ , so based on the distance of these points from the hyperplane a bigger penalty will be applied. Consequently, to increase the robustness of One-class SVM and to

and the concave parts

fcov (w, ρ ) = −

l 1 Hρ −s (g(xi )) vl

(1.19)

i=1

4.2. Concave–Convex Procedure (CCCP) for Ramp-OCSVM Since the formulation of the Ramp loss OC-SVM Eq. (1.17) consists of the convex part and the concave part, “Concave-Convex Procedure” (CCCP) procedure is introduced as an effective approach to solve this optimization problem [8]. The CCCP algorithm is related to the “Difference of Convex” (DC) method which it’s introducing time back to the 1990s decade [8,20,43]. CCCP procedure is considered as a powerful algorithm for solving non-differentiable non-convex optimization problems because of its simplicity in tuning, solving a sequence of convex problems in each iteration and totally it takes more beneﬁt by having a lower computational cost over the convex alternatives [8]. The CCCP framework for the problem (1.17) is constructed as follows (Algorithm 1):

Please cite this article as: Y. Tian et al., Ramp loss one-class support vector machine; A robust and effective approach to anomaly detection problems, Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.05.027

ARTICLE IN PRESS

JID: NEUCOM

[m5G;June 8, 2018;21:49]

Y. Tian et al. / Neurocomputing 000 (2018) 1–13 Algorithm 2 CCCP for Ramp loss OC-SVM.

Algorithm 1 CCCP for the problem (1.17). (1) Initialize (w1 , ρ 1 ), set k: = 0 (2) Construct and solve the problem min f vex (w, ρ ) + f cov (wk , ρk ), (w, ρ ) w,ρ

(1.20)

Get the solution ( ρ ) (3) If (w∗k , ρk∗ ) satisﬁed convergence condition, the output solution will be (w∗ , ρ ∗ ) = (w∗k , ρk∗ ), otherwise, set k = k + 1, go to step (2). w∗k ,

∗ k

(1) Input the training set S = {x1 , . . . , xn } (2) Select the proper value for s in the ramp loss Rρ , s (z) 0 < ν ≤ 1. (3) Solve the problem (1.20). (3.1) Initialize δ1 set k = 1 (3.2) Construct and solve the QP model (1.28) in the kth iterative step. Get the solution βk , compute ρ k based on the KKT conditions; Compute δk based on (1.21). (3.3) If δk = δk−1 , stop this iteration. Get the solution (β∗ , ρ ∗ ) = (βk , ρ k ). (4) Give a new sample x, predict its label based on the decision functions

sgn

In the rest, the steps to ﬁnd the solution for Eq. (1.20) is dis (w, ρ ) is a non-differentiable cussed. It should be noted that fcov at some points, therefore the CCCP procedure remains valid when using any upper-derivative of a concave function. Here, let us suppose δ = (δ1 , · · · , δl )T where

ρ − wT φ ( xi ) > s

−1/(l v ), 0,

δi =

(1.21)

otherwise

for i=1, . . . , l. By using Eq. (1.21), Eq. (1.20) can be rewritten as: 1 T w w 2

min

w,ξ ,ρ

+ v1l

l i=1

ξi − ρ +

w T φ ( x i ) ≥ ρ − ξi , ξi ≥ 0 , i = 1 , . . . , l

s.t.

l i=1

δi (ρ − wTi φ (xi ))

(1.22)

l l 1 T 1 w w+ ξi − ρ + δi (ρ − wTi φ (xi )) 2 vl i=1

−

l i=1

l

μ i ξi

(1.23)

i=1

where α=(α1 , . . . , αl ), μ=(μ1 , . . . , μl ) are the Lagrange multiplier vectors and the Karush–Kuhn–Tucker (KTT) conditions deﬁned as follows:

∇w L = w +

(δi − αi )φ (xi ) = 0

l i=1

∇ ξi L =

1

vl

(1.24)

δi +

l

αi = 0

(1.25)

i=1

− αi − μi = 0

(1.26)

By simplifying formula (1.24), we can obtain:

w=

l

(αi − δi )φ (xi )

(1.27)

i=1

Let’s consider α − δ=β and by putting (1.27) into the Lagrangian (1.23) and using (1.25) and (1.26), we obtain the dual problem as Eq. (1.29):

β T Qβ

min

1 2

s.t.

−vl δi ≤ βi ≤ l1v − vl δi , i = 1, . . . , l eT β = 1,

α, δ

(1.29)

It is obvious that at each iteration of CCCP (Algorithm 2), the main computational cost is from solving the QP model (1.28), when we use SMO to solve this sub-model, from [7], we know that its computational complexity is (#Iterationso f SMO ) × O(nl ), where l is the number of samples and n is the dimension. So the total complexity of Algorithm 2 is (#Iterationso fCCCP ) × (#Iterationso f SMO ) × O(nl ). In order to provide a continuous outlier score that provides more information about the signiﬁcance level of a data point to be considered as an anomalous point, we can use Eq. (1.31). l 1 Rρ ,s (τi f (xi )), vl

(1.30)

i=1

Where f (xi ) = wT φ (xi ) − ρ = lj=1 β j K (x j , xi ) − ρ and τi = −1, if the label of xi is predicted incorrectly, otherwise τi = 1.

In this part, we provide a scheme of experimental design including the experimental framework, preprocessing, experimental setup, data description, and the performance evaluation metrics. This part will end by providing a comprehensive discussion of the experimental results. 5.1. Experimental scheme

i=1

∇ρ L = −1 −

i=1

βi K (xi , x j ) − ρ

5. Experimental design for Ramp-OCSVM

i=1

αi (wT φ (xi ) − ρ + ξi ) −

l

l

Anomaly score =

We need to derive the dual form of problem (1.22) in order to solve it. Thus, its Lagrangian is given by

L=

5

(1.28)

In order to measure the performance of the proposed algorithm in different cases of anomaly detection, three experiments with different purpose, diverse experimental setting and performance measurements have been performed as described in the following: 1. The ﬁrst experiment has been done on some artiﬁcial data in R2 dimension. The aim of this excitement is to show the effectiveness of the Ramp loss function on the decision boundary of OC-SVM. 2. The second experiment is designed to assess the performance of Ramp-OCSVM in different one class classiﬁcation tasks compare to the original OC-SVM proposed by Schölkopf et al. [41]. This experiment is conducted on some of the chosen benchmark data sets taken from the UCI machine learning repository. 3. The third experiment is designed to evaluate the performance of the proposed method in a large-scale setting of network anomaly detection. In this case, the eﬃciency of Ramp-OCSVM has been tested on two well-known intrusion detection data sets including, NSL-KDD and UNSW-NB15 and some baselines are utilized to compare the performance of proposed method with them. 5.2. Experimental setup

Based on the aforementioned formulas, the Ramp loss One-class SVM can be constructed based on the CCCP procedure as shown in Algorithm 2:

All experiments are performed on Intel® CoreTM i7 CPU @ 3.60GHz computer with 8.00 GB RAM running Windows 7. The

Please cite this article as: Y. Tian et al., Ramp loss one-class support vector machine; A robust and effective approach to anomaly detection problems, Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.05.027

JID: NEUCOM 6

ARTICLE IN PRESS

[m5G;June 8, 2018;21:49]

Y. Tian et al. / Neurocomputing 000 (2018) 1–13

Fig. 3. The comparison of the decision boundary of Ramp-OCSVM and OC-SVM on the toy example set.

proposed Ramp-OCSVM is implemented using MATLAB 2013, and LIBSVM version 3.20 [7] is utilized as the SVM package. Different steps of these experiments are described as follows: Preprocessing: Since the intrusion detection data sets include both continues features and nominal variables, so they are

not compatible with some classiﬁer solvers. Hence, we took 1 of k coding to convert the nominal variables to the continuous format. This conversion is conducted in this way that ‘k’ different features are created instead of distinct ‘k’ values of categorical features [42]. Then 0 and 1 are used to represent the belonging of each K feature to the corresponding

Please cite this article as: Y. Tian et al., Ramp loss one-class support vector machine; A robust and effective approach to anomaly detection problems, Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.05.027

JID: NEUCOM

ARTICLE IN PRESS

[m5G;June 8, 2018;21:49]

Y. Tian et al. / Neurocomputing 000 (2018) 1–13

7

Fig. 4. Evaluation of Ramp-OCSVM performance on butterﬂy-shaped dataset.

categorical value. Moreover, all the data points are normalized into zero mean and unit variance. Normalization not only makes the numerical calculation easier, but also prevents attributes in greater numerical ranges prevail those in smaller numerical ranges 21,29]. Cross-validation: In order to prevent overﬁtting and validate the results of our experiments, k-fold cross-validation technique [40] is used for performance evaluation. Based on this method, the data set is randomly split into k parts and each of the k parts acted as an independent holdout test set. For each iteration, one part is selected as a testing and (k-1) rest of parts are considered as training data sets [56]. Finally, the accuracy of the model is calculated by taking the average of the accuracies obtained by running the model k times. In all the experiments have been done in this research, the value of k is taken as 10 because of low bias, low variance, and good error estimate. 2

Parameter setting: RBF kernel K (xi , x j ) = e−γ xi − x j is used as a mapping function to map the data points into a higher feature space. The parameters of this model consist of penalty factor v, RBF kernel parameter γ , Ramp loss parameter s. The value of these parameters has been determined by grid search technique with 10-fold cross-validation. The grid search for v and γ were done on [10−3 ,10−2 ,…,103 ] and [10−2 ,10−1 ,…,102 ], respectively. The Ramp loss param-

eter s controls the sensitivity of the model to the outliers. The greater value for this parameter, will decrease the effectiveness of the Ramp loss, thus Ramp-OCSVM will act as original OC-SVM. In fact, if s → +∞, then Rp, s → Hρ . In this research s ∈ (0, 5] has been chosen as the searching rage for s. 5.3. Performance evaluation metrics One of the commonly used measures to evaluate the performance of classiﬁcation techniques is Receiver Operator Characteristic (ROC) curve. This metric shows the relationship between the False Positive Rate and True Positive Rate. However, as discussed by Davis and Goadrich [10], in case of highly skewed data set, area under the Precision-Recall curve (PR-AUC) is a better choice to evaluate the model. The main difference in the appearance of both models is that ROC space should shift to the upper-left-hand corner, but the goal of PR curve is to be in the upper-right-hand corner. It needs to mention that algorithms which are developed based on ROC curve cannot guarantee to get the optimal space in PR curve. But if an algorithm is dominated in the PR space, it can guarantee to dominate in ROC space, too [10]. By considering the confusion matrix, four possible situations can be deﬁned as follows: • •

True positive (TP): correctly labeled the records as “Attack”. True negative (TN): correctly labeled the records as “Normal”.

Please cite this article as: Y. Tian et al., Ramp loss one-class support vector machine; A robust and effective approach to anomaly detection problems, Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.05.027

ARTICLE IN PRESS

JID: NEUCOM 8

[m5G;June 8, 2018;21:49]

Y. Tian et al. / Neurocomputing 000 (2018) 1–13 Table 1 Description of chosen UCI benchmark data sets for anomaly detection. Data set

# of records

# of features

# of classes

Outlier class (es)

# of outliers

Resized training set

Resized test set

Ionosphere Shuttle Breast_cancer Satellite

351 58,0 0 0 569 6435

33 9 30 36

2 7 2 7

b 2, 3, 5, 6 M 2,4, 5

126 2633 211 1364

164 38,889 261 3618

86 17,005 139 1726

Table 2 Details of the NSL-KDD data set for anomaly detection purpose.

Class

Traﬃc type

1 Normal 2 Abnormal Total records

•

•

Full NSL-KDD training set

20% NSL-KDD train set

20% NSL-KDD test set

# of records

% of frequency

# of records

Resized train set

# of records

Resized test set

67,341 58,630 125,971

53.46 46.54 1

13,449 11,743 25,192

13,449 1174 14,623

2152 9698 11,850

2152 970 3122

False negative (FN): wrongly labeled the attack record as “Normal”. False positive (FP): wrongly labeled the normal record as “Attack”.

According to the above deﬁnitions, the most common metrics to evaluate the performance of anomaly detection algorithm including recall, precision and F1 score are formulated in the following:

C

i=1

Recall = C

T Pi

i=1 (T Pi + F Ni )

C

Precision = C

i=1

F1 score = 2 ×

i=1

T Pi

(T Pi + F Pi )

P recision × Recall P recision + Recall

(1.31)

(1.32) (1.33)

6. Experimental results and discussions 6.1. Experimental results with artiﬁcial data In this part, two sets of two-dimensional (2D) toy examples are utilized as the artiﬁcial data sets to show the inﬂuence of Ramp loss function on decision boundary of Ramp-OCSVM, and to do more investigation on how the parameter settings affect the solutions. Let us consider Fig. 3, as the ﬁrst toy example. This artiﬁcial data set is created by MATLAB and contains 140 data samples which all of them labeled as a normal class with some noises and plotted in [−5, 5]2 domains. In order to generate the same data, the ﬁle shared in this repository (https://github.com/smhbamakan/ Generate-data) can be utilized. In Fig. 3, the comparison between the performance of RampOCSVM with the original model of OC-SVM has been done on artiﬁcial data. Figures on the left side represent the results of RampOCSVM and the results obtained by OC-SVM is represented on the right side of Fig. 3. Here, data points which considered as support vectors are marked by solid red circles. The experiments have been done with the different values for ν , γ ands. As discussed by Schölkopf et al. [41], the value of ν is introduced to control the ratio of outliers in the training sets. However, by comparing results illustrated in Fig. 3(b) and (d), it is obvious that by ﬁxing the other parameters, the larger value of ν (in Fig. 3(d), ν set as 0.5) causes the decision boundary of OC-SVM do not shift toward data points which inserted as outliers. Nevertheless, it cannot perfectly discover the shape of data samples in this toy example. Comparison of Fig. 3(a) and (b) reveals that with the same values for parameters ν = 0.1andγ = 0.1, as shown in Fig. 3(a) outliers have a lower impact on the decision boundary of Ramp-OCSVM because

of parameter s. In spite of not well tuning of parameters in Fig. 3(a) and (b), Ramp-OCSVM is more robust to the outliers compare to the OC-SVM. By changing the kernel width in Fig. 3(e) and(f) both models could distinguish the geometric shape of data points, with this difference that outliers in Fig. 3(f) could not be ignored, but the performance of Ramp-OCSVM in Fig. 3(e) is completely satisfactory. It can be concluded that with the smaller value of ν , the OC-SVM cannot ignore the impact of outliers, so its decision boundary will shift towards noises and outliers in the training sets. But in this case, the inﬂuence of outlier can be controlled by the parameter s in Ramp-OCSVM. In order to, further analyze the parameters of the proposed model, another experiment has been conducted on the second toy example. This artiﬁcial data set as shown in Fig. 4 contains 267 data points. Based on the results shown in Fig. 4, as was expected in a situation that the data sets consist of some noises and outliers, RampOCSVM shows better performance compared to the OC-SVM. Fig. 4(a) and (c) belong to the Ramp-OCSVM and Fig. 4(b) and (d) show the results of OC-SVM. Fig. 4 shows that choosing a smaller value for γ in both models, will cause an increase in the number of SVs and consequently the generalization power of models will be decreased. However, as the value of γ increased from 0.1 to 0.5 in Fig. 4(c) and (d) the number of SVs is dropped compared to Fig. 4(a) and (b), respectively. Finally, Fig. 4(c) which is result of RampOCSVM with ν = 0.1, γ = 0.5ands = 0.005 shows a robust performance in covering most of data samples in this butterﬂy shaped data set. 6.2. Experiments on UCI repository data sets In order to evaluate the performance of the proposed method in the detection of anomalous points in general cases and also to provide a better comparison framework, the same UCI benchmark data sets1 are selected as those have been chosen in [1]. The details of selected data sets are provided in Table 1. Furthermore, the results obtained by our model are compared against original OCSVM and the two extensions of this classiﬁer proposed by Amer et al. [1] with abbreviation ROCSVM and eta OCSVM. As shown in Table 1, the selected data sets are high dimensional with different classes. Since these data sets are originally designed for the classiﬁcation tasks, some preprocessing steps are needed to prepare them for the anomaly detection purpose. The training set is made by random selection of 70% of normal classes and 5% of outlier classes which their class labels are changed to the normal class. And the testing set consist of remaining 30% of

1

https://archive.ics.uci.edu/ml/datasets.html

Please cite this article as: Y. Tian et al., Ramp loss one-class support vector machine; A robust and effective approach to anomaly detection problems, Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.05.027

ARTICLE IN PRESS

JID: NEUCOM

[m5G;June 8, 2018;21:49]

Y. Tian et al. / Neurocomputing 000 (2018) 1–13

9

Table 3 Details of chosen %20 UNSW-NB15 for anomaly detection purpose.

Class

Traﬃc type

1 Normal 2 Analysis 3 Backdoor 4 DoS 5 Exploits 6 Fuzzers 7 Generic 8 Reconnaissance 9 Shellcode 10 Worms Total records

Normal class Anomaly class Total records

20% UNSW-NB15 train set

20% UNSW-NB15 test set

# of records

% frequency

# of records

% frequency

11,203 8129 6672 3617 2410 2130 401 366 244 34 35,206

0.3182 0.2309 0.1895 0.1027 0.0685 0.0605 0.0114 0.0104 0.0069 0.0010

7433 3867 2208 1208 791 680 138 131 62 8 16,526

0.4498 0.2340 0.1336 0.0731 0.0479 0.0411 0.0084 0.0079 0.0038 0.0 0 05

# of records

Resized train set

# of records

Resized test set

11,203 24,003 35,206

11,203 2400 13,603

7433 9093 16,526

7433 909 8342

Fig. 5. Compare the obtained PR-AUC for Ramp-OCSVM and OCSVM on some UCI repository.

classes that are considered as a normal class and 15% of outlier classes that randomly sampled. 6.3. Experimental results and discussion with some UCI repositories Fig. 5 compares the performance of Ramp-OCSVM and OC-SVM based on the PR curve for each data set. These results show the su-

periority of Ramp-OCSVM to the original model in three data sets including; ionosphere, shuttle and satellite. The obtained area under the PR curve for Ramp-OCSVM are 0.9827, 0.9952, 0.8916 for ionosphere, shuttle and satellite data sets, respectively. While OCSVM obtained 0.9138, 0.9769, 0.8395 as PR-AUC with the same order of data sets. Nevertheless, in breast cancer data set, the OCSVM shows better performance by achieving 86.56% area under

Please cite this article as: Y. Tian et al., Ramp loss one-class support vector machine; A robust and effective approach to anomaly detection problems, Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.05.027

ARTICLE IN PRESS

JID: NEUCOM 10

[m5G;June 8, 2018;21:49]

Y. Tian et al. / Neurocomputing 000 (2018) 1–13 Table 4 Per-class comparison with NSL-KDD based on Recall and Precision and F1 score. Normal class

OC-SVM ROCSVM eta OCSVM Ramp-OCSVM

Anomaly class

%Recall

%Precision

%F1 score

%Recall

%Precision

%F1 score

95.49 94.10 96.87 98.75

98.09 97.59 98.29 99.21

96.77 95.81 97.58 98.98

95.88 94.85 96.33 98.25

90.56 87.87 93.37 97.24

93.14 91.22 94.83 97.74

Table 5 Per-class comparison with UNSW-NB15 based on Recall and Precision and F1 Score. Normal class

OC-SVM ROCSVM eta OCSVM Ramp-OCSVM

%Recall

%Precision

%F1 score

%Recall

%Precision

%F1 score

96.54 94.66 95.74 97.75

98.31 97.47 97.76 99.14

97.42 96.04 96.74 98.44

86.47 79.87 82.07 93.07

75.36 64.65 70.18 83.51

80.53 71.46 75.66 88.03

PR curve compare to the 82.13% PR curve for Ramp-OCSVM. As shown in Fig. 5, we can conclude that the proposed model can reduce the negative impact of unexpected outliers in the training sets and prove the robustness of Ramp-OCSVM compare to the OCSVM. By comparing the results presented in Fig. 5, it conducted that the proposed Ramp-OCSVM get better results in most of the cases based on PR-AUC indicator. 6.4. Experiments on computer network anomaly detection data sets NSL-KDD data set2 : Here, we provide a brief summary of this data set and the details of a randomly chosen subset of this data set are presented in this section. The NSL-KDD data set consists of 125,971 numbers of connection records and the same as a KDD99 data set, it contains 41 attributes including both continuous and categorical features. According to the detection approaches, the class label of each record shows whether a particular connection is normal or abnormal in the case of anomaly detection. The attack label is one of the 24 types of attacks falling into the four categories including Denial of Service (DoS), Probe, Users to Root (U2R), or Remote to Local (R2L) which were all categorized as an anomalous class. The details of the NSL-KDD data set for our anomaly detection purpose are introduced in Table 2 [2,37,45]. UNSW-NB15 data set3 : this is a recently published data set for evaluating intrusion detection techniques. This data set prepared by Nour and Slay in 2015 [34] in the Cyber Range Lab of the Australian Centre for Cyber Security (ACCS). The main purpose of preparing this data set was to cover more modern network trafﬁc patterns with the vast varieties of low footprint intrusions and depth structured information about the network traﬃc. Although the full data set contains 2,540,044 data samples, for the computational eﬃcacy the authors [34] prepared a subset of the training set and testing set, which contain 175,341 records and 82,332 records, respectively. The details of these data sets for the purpose of anomaly detection are provided in Table 3. 6.5. Experimental results and discussion with network anomaly detection data sets In this section, we presented the experimental results on anomaly detection data sets. Here, to further examine the performance of the proposed algorithm, the results of our model 2

http://www.unb.ca/cic/datasets/nsl.html https://www.unsw.adfa.edu.au/australian- centre- for- cyber- security/ cybersecurity/ADFA- NB15- Datasets/ 3

Anomaly class

Table 6 The overall performance comparison of one-class SVMs on NSL-KDD and UNSW-NB15. NSL-KDD data set

UNSW-NB15 data set

Algorithm

ACC

DR

FAR

ACC

DR

FAR

OC-SVM ROCSVM eta OCSVM Ramp-OCSVM

95.61 94.33 96.70 98.59

95.88 94.85 96.33 98.25

4.51 5.90 3.13 1.25

95.44 93.05 94.25 97.24

86.47 79.87 82.07 93.07

3.46 5.34 4.26 2.25

are compared with the results obtained by the original OC-SVM, ROCSVM and eta OCSVM which the latter two methods are proposed by Amer et al. [1]. The results of these four algorithms with NSL-KDD and UNSW-NB15 are presented in Tables 4 and 5, respectively. These results are based on recall, precision and F1 score measures per each class of both data sets. From Tables 4 and 5, we can see that the Ramp-OCSVM outperforms the OC-SVM, ROCSVM and eta OCSVM on both network anomaly data sets. At NSL-KDD data set, eta OCSVM stands in second place based on these three measures, but in UNSW-NB15 data set, OC-SVM gets better results compare to the ROCSVM and eta OCSVM. By focusing more on details of Table 5, it is obvious that the value of recall, precision and F1 Score for anomaly class in the UNSW-NB15 data set is lower than the obtained results for anomaly class in NSL-KDD. It may because of the diversity of anomaly types in UNSW-NB15. In fact, anomaly class in UNSWNB15 has been generated by random sampling from 9 groups of attacks, while the anomaly class in NSL-KDD sampled from 4 types of attacks. In addition to the per class comparison, total accuracy, total detection rate and total false alarm rate are used to compare the overall performance of Ramp-OCSVM with other forms of one-class SVM on NSL-KDD and UNSW-NB15. The detection results of these algorithms are shown in Table 6. From the results of Table 6, we can see that the proposed Ramp-OCSVM obtained the best accuracy on both data sets. The experimental results on NSL-KDD show the superiority of RampOCSVM on other algorithms by achieving an accuracy of 98.59% with a total detection rate of 98.25% and a false alarm rate of 1.25%. With the UNSW-NB15, values of 97.24%, 93.07% and 2.25% for the total accuracy, detection rate, false alarm rate has been obtained respectively. The experimental results on UNSW-NB15 data set show that the OC-SVM could get better results compare to the ROCSVM and eta OCSVM. But in NSL-KDD, eta OCSVM shows relatively comparable results with our model. Furthermore, the aver-

Please cite this article as: Y. Tian et al., Ramp loss one-class support vector machine; A robust and effective approach to anomaly detection problems, Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.05.027

ARTICLE IN PRESS

JID: NEUCOM

[m5G;June 8, 2018;21:49]

Y. Tian et al. / Neurocomputing 000 (2018) 1–13

11

Fig. 6. The average classiﬁcation error rates of compared methods. Table 7 Comparison between some of the existing networks anomaly detection methods. Anomaly detection methods

Accuracy %

Detection rate %

False alarm rate %

Random effects logistic regression [33] The growing hierarchical self organizing map (GHSOM) [23] U-BRAIN [9] N-KPCA-GA-SVM [28] Online sequential extreme learning machine [42] Clustering based on Self-Organized Ant Colony Network (CSOACN) [15] Ensembled technique [16] Fuzzy Clustering Based ANN [38] The proposed method

98.68c 99.63a 94.1 – 98.66b

– 94.04 89.3 95.26 99.01a 94.86 98.98b 98 98.25c

– 1.8c – 1.03a 01.74 6.01 2.28 3.05 1.25b

a b c

96.28 – 98.59

First rank. Second rank. Third rank.

age classiﬁcation error rates are shown in Fig. 6 to provide a better comparison metric for the detection performance of these selected algorithms. Based on the results in Fig. 6, the proposed Ramp-OCSVM shows the best performance compared to the other methods by obtaining 1.41% and 2.76% as the average classiﬁcation errors on NSL-KDD and UNSW-NB15 data sets, respectively. On the other hand, ROCVSM gets the worst results by getting 5.67% classiﬁcation error rate on NSL-KDD and 6.95% classiﬁcation error rate on UNSW-NB15. Totally, the order of these methods based on the classiﬁcation error on NSL-KDD data set is Ramp-OCSVM, eta OCSVM, OC-SVM and ﬁnally ROCSVM. While based on the best results obtained on the UNSW-NB15 data set, this order is Ramp-OCSVM, OC-SVM, eta OCSVM, and ﬁnally ROCSVM. It is relatively worthwhile to compare the performance of the proposed method with some of the existing network anomaly detection methods. It should be noted that this comparison is more valuable and meaningful when it would be comprehensive with considering a wide range of criteria such as training and testing data set size, methods of sampling, the preprocessing techniques especially the number of selected features, the computational costs and so on. However, in Table 7, we have tried to cover the most recently published methods which both the true positive rate and false positive rates are provided. Based on the results shown in Table 7 the proposed method illustrates the acceptable results compare to the existing approaches. 7.1. Summary and conclusion Anomaly detection can be considered as the task of identifying unexpected data points, items or events that do not match with the patterns of other observations in a data set which has many applications in data mining ﬁeld. Although a wide range of techniques has been proposed to address this problem, boundarybased methods include One-class SVM attracts the attention many

researchers. This is because of its unsupervised nature and also strong generalization power. However, this classiﬁer is severely sensitive to the presence of noises and outliers in the training set which cause the decision boundary of OCSVM shifts toward the outlier points. To cope with this problem, we developed a rubout OC-SVM by introducing the Ramp loss function instead of the Hinge loss. In this paper, Ramp-OCSVM, is developed as a robust and sparse anomaly detection method. We evaluated the performance of the proposed Ramp-OCSVM, by conducting three different experiments. The ﬁrst experiment has been performed on twodimensional artiﬁcial data sets with the different geometric shapes. The second experiment has been conducted on some data sets from UCI repository to evaluate the effectiveness of our method in general anomaly detection cases. Finally, NSL-KDD data set and UNSW-NB15 as two network anomaly detection data sets are utilized to evaluate the effectiveness of our model. Based on the obtained experimental results, the proposed Ramp-OCSVM outperformed the existing models in terms of robustness to the outliers, getting a better generalization power and superiority in detecting anomalies. Acknowledgments This work has been partially supported by CAS-TWAS President’s Fellowship for International Ph.D. Student and China Postdoctoral Science Foundation Grant (No. 2018M633186). References [1] M. Amer, M. Goldstein, S. Abdennadher, Enhancing one-class support vector machines for unsupervised anomaly detection, in: Proceedings of the ACM SIGKDD Workshop on Outlier Detection and Description, 2013. [2] S.M.H. Bamakan, H. Wang, T. Yingjie, Y. Shi, An effective intrusion detection framework based on MCLP/SVM optimized by time-varying chaos particle swarm optimization, Neurocomputing 199 (2016) 90–102.

Please cite this article as: Y. Tian et al., Ramp loss one-class support vector machine; A robust and effective approach to anomaly detection problems, Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.05.027

JID: NEUCOM 12

ARTICLE IN PRESS

[m5G;June 8, 2018;21:49]

Y. Tian et al. / Neurocomputing 000 (2018) 1–13

[3] M. Bicego, M.A. Figueiredo, Soft clustering using weighted one-class support vector machines, Pattern Recognit. 42 (1) (2009) 27–32. [4] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers, Found. Trends Mach. Learn. 3 (1) (2011) 1–122. [5] J.P. Brooks, Support vector machines with the ramp loss and the hard margin loss, Oper. Res. 59 (2) (2011) 467–479. [6] V. Chandola, A. Banerjee, V. Kumar, Anomaly detection: a survey, ACM Comput. Surv. (CSUR) 41 (3) (2009) 15. [7] C.-C. Chang, C.-J. Lin, LIBSVM: a library for support vector machines, ACM Trans. Intell. Syst. Technol. 2 (3) (2011) 27:21–27:27. [8] R. Collobert, F. Sinz, J. Weston, L. Bottou, Trading convexity for scalability, in: Proceedings of the Twenty-third International Conference on Machine Learning, 2006. [9] G. D’angelo, F. Palmieri, M. Ficco, S. Rampone, An uncertainty-managing batch relevance-based approach to network anomaly detection, Appl. Soft Comput. 36 (2015) 408–418. [10] J. Davis, M. Goadrich, The relationship between Precision-Recall and ROC curves, in: Proceedings of the Twenty-third International Conference on Machine Learning, 2006. [11] L.I. Dong, S. Liu, H. Zhang, A method of anomaly detection and fault diagnosis with online adaptive learning under small training samples, Pattern Recognit. 64 (2017) 374–385. http://dx.doi.org/10.1016/j.patcog.2016.11.026. [12] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classiﬁcation, Vol. 2, Wiley, New York, 1973. [13] F. Edgeworth, Xli. on discordant observations, Lond. Edinburgh Dublin Philosoph. Mag. J. Sci. 23 (143) (1887) 364–375. [14] S.M. Erfani, S. Rajasegarar, S. Karunasekera, C. Leckie, High-dimensional and large-scale anomaly detection using a linear one-class SVM with deep learning, Pattern Recognit. 58 (2016) 121–134. [15] W. Feng, Q. Zhang, G. Hu, J.X. Huang, Mining network data for intrusion detection through combining SVMs with ant colony networks, Future Gener. Comput. Syst. 37 (2014) 127–140. [16] S. Garg, S. Batra, A novel ensembled technique for anomaly detection, Int. J. Commun. Syst. (2016). [17] G. Giacinto, R. Perdisci, M. Del Rio, F. Roli, Intrusion detection in computer networks by a modular ensemble of one-class classiﬁers, Inf. Fus. 9 (1) (2008) 69–82. [18] G.L. Grinblat, L.C. Uzal, P.M. Granitto, Abrupt change detection with one– class time-adaptive support vector machines, Expert Syst. Appl. 40 (18) (2013) 7242–7249. [19] Z. Gu, Z. Zhang, J. Sun, B. Li, Robust image recognition by L1-norm twin-projection support vector machine, Neurocomputing 223 (2017) 1–11. [20] S.M. Hosseini Bamakan, H. Wang, Y. Shi, Ramp loss K-Support vector classiﬁcation-regression; a robust and sparse multi-class approach to the intrusion detection problem, Knowl. Based Syst. 126 (2017) 113–126. https://doi. org/10.1016/j.knosys.2017.03.012. [21] C.-L. Huang, J.-F. Dun, A distributed PSO–SVM hybrid system with feature selection and parameter optimization, Appl. Soft Comput. 8 (4) (2008) 1381–1391. [22] X. Huang, L. Shi, J.A. Suykens, Ramp loss linear programming support vector machine, J. Mach. Learn. Res. 15 (1) (2014) 2185–2211. [23] D. Ippoliti, X. Zhou, A-GHSOM: an adaptive growing hierarchical self organizing map for network anomaly detection, J. Parallel Distrib. Comput. 72 (12) (2012) 1576–1590. [24] A.K. Jain, R.C. Dubes, Algorithms for Clustering Data, Prentice-Hall, Inc, 1988. [25] A. Karami, M. Guerrero-Zapata, A fuzzy anomaly detection system based on hybrid pso-kmeans algorithm in content-centric networks, Neurocomputing 149 (2015) 1253–1269. [26] W. Khreich, B. Khosravifar, A. Hamou-Lhadj, C. Talhi, An anomaly detection system based on variable N-gram features and one-class SVM, Inf. Softw. Technol. 91 (2017) 186–197. https://doi.org/10.1016/j.infsof.2017.07.009. [27] G. Kim, S. Lee, S. Kim, A novel hybrid intrusion detection method integrating anomaly detection with misuse detection, Expert Syst. Appl. 41 (4) (2014) 1690–1700. [28] F. Kuang, W. Xu, S. Zhang, A novel hybrid KPCA and SVM with GA model for intrusion detection, Appl. Soft Comput. 18 (2014) 178–184. [29] S.-W. Lin, K.-C. Ying, S.-C. Chen, Z.-J. Lee, Particle swarm optimization for parameter determination and feature selection of support vector machines, Expert Syst. Appl. 35 (4) (2008) 1817–1824. http://dx.doi.org/10.1016/j.eswa.2007. 08.088. [30] D. Liu, Y. Tian, Y. Shi, Ramp loss nonparallel support vector machine for pattern classiﬁcation, Knowl. Based Syst. (2015). [31] Y. Liu, B. Zhang, B. Chen, Y. Yang, Robust solutions to fuzzy one-class support vector machine, Pattern Recognit. Lett. 71 (2016) 73–77. https://doi.org/ 10.1016/j.patrec.2015.12.014. [32] M.V. Mahoney, P.K. Chan, An analysis of the 1999 DARPA/Lincoln Laboratory evaluation data for network anomaly detection, in: Proceedings of the Recent Advances in Intrusion Detection, 2003. [33] M.S. Mok, S.Y. Sohn, Y.H. Ju, Random effects logistic regression model for anomaly detection, Expert Syst. Appl. 37 (10) (2010) 7162–7166. [34] N. Moustafa, J. Slay, UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set), in: Proceedings of the Military Communications and Information Systems Conference (MilCIS), 2015, 2015.

[35] N. Moustafa, J. Slay, The evaluation of Network Anomaly Detection Systems: Statistical analysis of the UNSW-NB15 data set and the comparison with the KDD99 data set, Inf. Secur. J. A Global Perspect. (2016). [36] K. Nian, H. Zhang, A. Tayal, T. Coleman, Y. Li, Auto insurance fraud detection using unsupervised spectral ranking for anomaly, J. Finance Data Sci. 2 (1) (2016) 58–75. [37] NSL-KDD. (2016). http://www.unb.ca/research/iscx/dataset/iscx- NSL- KDDdataset.html. [38] N. Pandeeswari, G. Kumar, Anomaly detection system in cloud environment using fuzzy clustering based ANN, Mob. Netw. Appl. 21 (3) (2016) 494–505. [39] M. Salem, Adaptive Real-time Anomaly-Based Intrusion Detection Using Data Mining and Machine Learning Techniques, Univ., Diss., Kassel, 2014. [40] S.L. Salzberg, On comparing classiﬁers: Pitfalls to avoid and a recommended approach, Data Mining Knowl. Discov. 1 (3) (1997) 317–328. [41] B. Schölkopf, J.C. Platt, J. Shawe-Taylor, A.J. Smola, R.C. Williamson, Estimating the support of a high-dimensional distribution, Neural Comput. 13 (7) (2001) 1443–1471. [42] R. Singh, H. Kumar, R. Singla, An intrusion detection system using network traﬃc proﬁling and online sequential extreme learning machine, Expert Syst. Appl. 42 (22) (2015) 8609–8624. [43] P.D. Tao, L.T.H. An, A DC optimization algorithm for solving the trust-region subproblem, SIAM J. Optim. 8 (2) (1998) 476–505. [44] D.M. Tax, R.P. Duin, Support vector data description, Mach. Learn. 54 (1) (2004) 45–66. [45] The KDD99 Dataset. (1998). Reterived April 15, 2015, from http://kdd.ics.uci. edu/databases/kddcup99/kddcup99.html. [46] C.-H. Tsang, S. Kwong, H. Wang, Genetic-fuzzy rule mining approach and evaluation of feature selection techniques for anomaly intrusion detection, Pattern Recognit. 40 (9) (2007) 2373–2391. [47] H. Wang, J. Miao, S.M.H. Bamakan, L. Niu, Y. Shi, Large-scale Nonparallel Support Vector Ordinal Regression Solver, Procedia Comput. Sci. 108 (2017) 1261–1270. [48] J. West, M. Bhattacharya, Intelligent ﬁnancial fraud detection: a comprehensive review, Comput. Secur. 57 (2016) 47–66. [49] Y. Wu, Y. Liu, Robust truncated hinge loss support vector machines, J. Am. Stat. Assoc. 102 (479) (2007) 974–983. [50] Y. Xiao, H. Wang, W. Xu, Ramp Loss based robust one-class SVM, Pattern Recognit. Lett. 85 (2017) 15–20. [51] A.R. Yan, B.Q. Ye, C.L. Zhang, D.N. Ye, E.X. Shu, A feature selection method for projection twin support vector machine, Neural Process. Lett. (2017) 1–18. [52] H. Yan, Q. Ye, D.-J. Yu, X. Yuan, Y. Xu, L. Fu, Least squares twin bounded support vector machines based on L1-norm distance metric for classiﬁcation, Pattern Recognit. 74 (2018) 434–447. [53] Q. Ye, J. Yang, T. Yin, Z. Zhang, Can the virtual labels obtained by traditional LP approaches be well encoded in WLR? IEEE Trans. Neural Netw. Learn. Syst. 27 (7) (2016) 1591–1598. [54] S. Yin, X. Zhu, C. Jing, Fault detection based on a robust one class support vector machine, Neurocomputing 145 (2014) 263–268. [55] L. Zhang, J. Lin, R. Karim, An angle-based subspace anomaly detection approach to high-dimensional data: With an application to industrial fault detection, Reliab. Eng. Syst. Saf. 142 (2015) 482–497. http://dx.doi.org/10.1016/j. ress.2015.05.025. [56] Z. Zhang, T.W. Chow, Maximum margin multisurface support tensor machines with application to image classiﬁcation and segmentation, Expert Syst. Appl. 39 (1) (2012) 849–860. [57] F. Zhu, J. Yang, C. Gao, S. Xu, N. Ye, T. Yin, A weighted one-class support vector machine, Neurocomputing 189 (2016) 1–10. Tian Yingjie received the First degree in mathematics in 1994, the master’s degree in applied mathematics in 1997, and the Ph.D. degree in management science and engineering. He is currently a Professor with the Research Center on Fictitious Economy and Data Science, Chinese Academy of Sciences, Beijing, China. He has published four books about support vector machines (SVMs), one of which has been cited over 10 0 0 times. His current research interests include SVMs, optimization theory and applications, data mining, intelligent knowledge management, and risk management.

Mahboubeh Mirzabagheri is currently a Ph.D. student with the school of Economics and Management, University of Chinese Academy of Sciences, Beijing, China. She is also studying in Research Center on Fictitious Economy and Data Science, Chinese Academy of Sciences, Beijing, China. Her current research interests include Sentiment analysis, social networks analysis and data mining.

Please cite this article as: Y. Tian et al., Ramp loss one-class support vector machine; A robust and effective approach to anomaly detection problems, Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.05.027

JID: NEUCOM

ARTICLE IN PRESS Y. Tian et al. / Neurocomputing 000 (2018) 1–13 Seyed Mojtaba Hosseini Bamakan received his Ph.D. degree in Management Science and Engineering from University of Chinese Academy of Sciences (UCAS) in 2017, and his master degree in IT management ﬁeld from Allameh Tabataba’i University (ATU), Iran in 2009. He is currently a postdoctoral scholar at Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China. His current research interests include information security, business intelligence, big data mining and intelligent optimization techniques.

[m5G;June 8, 2018;21:49] 13 Qiang Qu is an associate professor, the executive director of the Global Center for Big Mobile Intelligence at Shenzhen Institutes of Advanced Technology (SIAT), Chinese Academy of Sciences (CAS). He is a candidate of the CAS Poineer Hundred Talents Program. He received the MSc degree in computer science from Peking University and the Ph.D. degree from Aarhus University. His current research interests include large-scale data management and mining.

Huadong Wang received the Ph.D. degree from College of Mathematica Science, University of Chinese Academy of Sciences in 2017, Beijing, China. He was also studying in Research Center on Fictitious Economy and Data Science, Chinese Academy of Sciences, Beijing, China. His current research interests include support vector machines, optimization theory and applications and data mining.

Please cite this article as: Y. Tian et al., Ramp loss one-class support vector machine; A robust and effective approach to anomaly detection problems, Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.05.027