GP-ELM-RNN: Garson-pruned extreme learning machine based replicator neural network for anomaly detection

GP-ELM-RNN: Garson-pruned extreme learning machine based replicator neural network for anomaly detection

Journal of King Saud University – Computer and Information Sciences xxx (xxxx) xxx Contents lists available at ScienceDirect Journal of King Saud Un...

907KB Sizes 0 Downloads 6 Views

Journal of King Saud University – Computer and Information Sciences xxx (xxxx) xxx

Contents lists available at ScienceDirect

Journal of King Saud University – Computer and Information Sciences journal homepage: www.sciencedirect.com

GP-ELM-RNN: Garson-pruned extreme learning machine based replicator neural network for anomaly detection Adeel Shiraz Hashmi ⇑, Tanvir Ahmad Department of Computer Engineering, Faculty of Engineering, Jamia Millia Islamia, New Delhi, India

a r t i c l e

i n f o

Article history: Received 21 May 2019 Revised 13 August 2019 Accepted 13 September 2019 Available online xxxx Keywords: Anomaly detection Big data Garson algorithm Replicator neural network Extreme learning machine

a b s t r a c t Replicator Neural Network (RNN) is a popular algorithm for anomaly detection, but finding optimal number of hidden layers and then finding optimal number of neurons in each hidden layer is quite a challenging and time-consuming task. Extreme Learning Machines (ELM) are neural networks with single-hidden layer but the learning algorithm is different and faster than back-propagation. ELM-based RNNs solve our problem of determining the number of hidden layers and the learning algorithm is also faster than gradient-descent based RNN. The problem of identifying the optimal number of neurons in the hidden layer can be solved by Garson algorithm. In this work, the author propose an optimal Replicator Neural Network which is optimized using ELM learning and Garson algorithm for anomaly detection. The experimental results show that the proposed method is fast as well as highly accurate. Ó 2019 The Authors. Production and hosting by Elsevier B.V. on behalf of King Saud University. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

1. Introduction Anomaly detection deals with identifying those few instances in the dataset which are different from high majority of other instances. Anomaly detection finds its application in areas like fraud detection, intrusion detection, cancer detection, malicious node detection, etc. (Chandola et al., 2009). Most of the algorithms of anomaly detection belong to the category of unsupervised machine learning, as supervised learning algorithms fail to generate a proper classification model for data with disproportionate ratio of the class instances. Surveying the unsupervised learning algorithms for anomaly detection, the distance-based approach is found to be utilized in most of the algorithms. However, the distance-based approach is not suitable for large datasets even on platforms like Hadoop, Spark, etc. as calculating distance vectors of millions of rows with other millions of rows does not seem like a feasible approach both in terms of time as well as space complexity.

⇑ Corresponding author. E-mail address: [email protected] (A.S. Hashmi). Peer review under responsibility of King Saud University.

Production and hosting by Elsevier

Replicator Neural Networks (RNN) (Hecht-Nielsen, 1995; Hawkins et al., 2002) is a semi-supervised learning approach for anomaly detection which does not make use of distance-vectors, instead it is a neural network based approach. Hence, RNN seems to be a feasible solution while dealing with large datasets for anomaly detection. Extreme Learning Machine (ELM) (Huang et al., 2004) is an alternative learning technique to backpropagation in single-hidden-layer feed-forward network (SLFN). The advantage of ELM over SLFN is that ELM is faster than SLFN as input weights of only one layer of neurons are updated instead of both layers. The back-propagation algorithm in SLFN is based on gradient-descent and so its learning speed is slow; ELM learning is not based on gradient-descent and a faster matrix-based mechanism is employed. This research paper proposes an optimized ELM-RNN based approach for anomaly detection in large datasets. The significance of the proposed algorithm is that it is multiple times faster than the state-of-the-art algorithms for anomaly detection and it can work on large datasets as well. This has been achieved by optimizing the Replicator Neural Network at various levels viz. the learning algorithm, parallel environment, and ability to run on GPU/TPU. The paper is further divided into the following sections: Section 2 surveys the anomaly detection algorithms as well as RNN and ELM algorithms; Section 3 presents the proposed Garson-pruned ELM-RNN solution for anomaly detection, Section 4 deals with experiments and results, and in Section 5 the conclusion is given and future directions are discussed.

https://doi.org/10.1016/j.jksuci.2019.09.007 1319-1578/Ó 2019 The Authors. Production and hosting by Elsevier B.V. on behalf of King Saud University. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Please cite this article as: A. S. Hashmi and T. Ahmad, GP-ELM-RNN: Garson-pruned extreme learning machine based replicator neural network for anomaly detection, Journal of King Saud University – Computer and Information Sciences, https://doi.org/10.1016/j.jksuci.2019.09.007

2

A.S. Hashmi, T. Ahmad / Journal of King Saud University – Computer and Information Sciences xxx (xxxx) xxx

2. Literature review In this section, the survey of anomaly detection is presented, followed by discussion on RNN, and then the mathematical formulation of ELM is presented.

2.1. Algorithms for anomaly detection The anomaly detection techniques can be broadly divided into three categories: visual/graphical, statistical and machine learning. The visual techniques for anomaly detection includes scatter-plots which can help us discover anomalies in 2D datasets, but for large and multi-dimensional datasets we need some enhanced techniques like sparse traversal, etc. (Bryson et al., 1999). A statistics based definition of an outlier is: ‘‘an outlier is any data point more than 1.5 interquartile ranges (IQR) below the first quartile or above the third quartile”; based on this definition, box-plots can be used for outlier detection (Tukey, 1977). Other statistical measures like frequencies (Koufakou et al., 2008) and entropy (Liu et al., 2013) have also been used in literature for anomaly detection. The machine learning based algorithms for anomaly detection can be divided into three categories: supervised, semi-supervised and unsupervised. Supervised anomaly detection is analogous to a binary classification problem, in which a training dataset is provided containing points labeled as normal and outliers. All the classifiers like kNN, SVM, perceptron, Naive Bayes, Random Forest, etc. can be used for anomaly detection, however, supervised learning is not often used for anomaly detection as the ratio of outliers to normal points is quite low, hence training for outlier class is often not sufficient which leads to poor performance of the algorithms. Instead of supervised learning, semi-supervised learning is preferred for detecting anomalies. Semi-supervised learning is analogous to one-class supervised learning i.e. a model representing normal behavior is constructed from a training data set. Those points in the test dataset which deviate from the normal are labeled as anomalies. The popular examples of semi-supervised algorithms are One-Class SVMs (Schölkopf et al., 2000) and Replicator Neural Networks (Hawkins et al., 2002). Most of the anomaly detection algorithms fall in the category of unsupervised learning. These unsupervised learning based algorithms can further be classified as distance-based and densitybased. In distance-based approach, an object O in the dataset D is declared as an outlier if at least fraction p of the objects in D are farther than distance d from O (Knox and Ng, 1998). Angiulli and Pizzuti (Angiulli and Pizzuti, 2002; Angiulli and Pizzuti, 2005; Angiulli et al., 2006) calculated the sum of distances of the point under consideration with rest of the points in the dataset; identified top n points which had highest cumulative sum, and marked them as outliers. Top-n kNN (Ramaswamy et al., 2000) is a simple approach for outlier detection, where a distance measure like Euclidean distance can be used to identify k (user-defined value) nearest neighbors for each object, and declare n objects having highest cumulative/average value from its k neighbors as outliers. Distance-based clustering algorithms like k-means, PAM (partition around medoids), etc. have also been used for outlier detection. In outlier detection by k-means algorithm, initially k clusters are generated using k-means clustering algorithm, then n points which have largest distance from their cluster centers are identified, to get n outliers. DBSCAN (density-based spatial clustering of applications with noise) (Ester et al., 1996) is one of the earliest algorithms to identify outliers, although its main objective is clustering. In DBSCAN, two parameters eps and minPts are set; then a random point in the dataset is picked and all the points in its neighborhood which are at a distance less than eps away are considered to form the

cluster. Next the number of such point are counted, if this count is more than minPts then this cluster is called as dense cluster. This cluster is expanded by repeating the process for all the new points in the cluster. If we run out of new points and there are still points left to consider in the dataset then again a random point is picked and the process is restarted till all points in the dataset are considered. A point which has less than minPts in its cluster and is also not a part of any other cluster is declared as an outlier. LOF (Breunig et al., 2000) is a non-clustering density-based outlier detection algorithm derived from DBSCAN algorithm, where a LOF (local outlier factor) score is calculated for each data point. In this method, the number of neighborhood points (n) to consider is set a priori; a reachability distance/density is computed (for the data point p under consideration) with n nearest neighbors. The reachability density of each of the n nearest neighbors is also calculated. The LOF score of the data point is ratio of the average density of the n nearest neighbor of the point and the density of the point itself. For a normal data point, the density of the point will be similar to that of its neighbors and the LOF value is low, whereas for an outlier LOF score will be high. LOCI (Local Correlation Integral) (Papadimitriou et al., 2003) addresses the difficulty of selecting the value of n in LOF by using a different criteria for the neighborhood. In LOCI, all the points within a value of r (radius) are considered as neighbors, and the reachability density is calculated w.r.t. to all these data points. In LOCI, a MDEF value is calculated for each point; MDEF (Multi-granularity Deviation Factor) at radius r for a point is the relative deviation of its neighborhood density from the average neighborhood density in its r neighborhood. Different values of r are taken and if the MDEF value deviates three times from the standard deviation of MDEFs in the neighborhood then the point is flagged as an outlier. Various strategies for anomaly detection in very large and high dimensional datasets have been discussed in literature. Anna et al. (Koufakou et al., 2008) implemented AVF (Attribute Value Frequency) in Map-Reduce for fast outlier detection in large categorical datasets. Liu et al. (Liu et al., 2013) proposed an Entropy-based Fast Detection algorithm for outlier detection in large datasets. Yan et al. (Yan, 2015) proposed a distributed outlier detection algorithm which employs compressive sensing for sampling highdimensional data. DROUT (Dimensionality Reduction for OUTlier detection) can be used for anomaly detection in ‘‘very wide” datasets i.e. datasets with large number of features. Distributed Solving Sets (DSS) and Lazy Distributed Solving Sets (LDSS) (Angiulli et al., 2010) are distributed strategies for anomaly detection in large datasets. For shorter turnaround time, GPU versions of parallel algorithms like DSS have also been implemented. RAD (Rapid Anomaly Detection) algorithm is used by NetFlix for outlier detection. RAD is able to handle high cardinality dimensions, nonnormalized datasets, seasonality and minimizes false positives. RAD is based on Robust Principal Component Analysis (RPCA) which repeatedly calculates SVD (Singular Value Decomposition) and applies thresholds to singular values in each iteration. Recently, Twitter released an R package for anomaly detection in time-series data, which consists of S-H-ESD (Seasonal Hybrid extreme Studentized deviate) test for anomaly detection. 2.2. Replicator neural networks RNN (Hecht-Nielsen, 1995) shown in Fig. 1, is a feed-forward neural network which reproduces the input data pattern at the output layer and aims to minimize the error in training process by adjusting the weights associated with the inputs of each neuron. The input as well as output layers have N neurons corresponding to the N features of the training data, whereas the number of neurons in the middle layers is much less (as the primary aim of RNN is compression). The original RNN proposed by Nielsen has

Please cite this article as: A. S. Hashmi and T. Ahmad, GP-ELM-RNN: Garson-pruned extreme learning machine based replicator neural network for anomaly detection, Journal of King Saud University – Computer and Information Sciences, https://doi.org/10.1016/j.jksuci.2019.09.007

A.S. Hashmi, T. Ahmad / Journal of King Saud University – Computer and Information Sciences xxx (xxxx) xxx

3

Fig. 1. Replicator Neural Network.

three hidden layers b/w the input and output layers, with tanh(akh) activation function in the hidden layer neurons, and linear/sigmoid function in the output layer; for the middle hidden layer, the activation function is staircase-like (taking k = 3 in the tanh activation function). This step-wise activation function divides the continuously distributed data points into discrete-levels to achieve compression. Originally, RNN was developed to achieve data compression, and (Hawkins et al., 2002) adopted the same architecture for detecting anomalies. RNN is a semi-supervised learning technique, and we need a training as well as testing phase, although the data in training phase here is unlabeled. For anomaly detection, the dataset is same for training as well as testing. In training phase, the dataset is used to set the input weights of each neuron, whereas in testing phase, the outlier factor (OF) of each data point is calculated by the RNN acc. to Eq. (1).

OF i ¼

n  2 1X xij  tij n j¼1

ð1Þ

OFi is defined by the average re-construction error over all the n features. Points with high OF value are marked as outliers. 2.3. Extreme learning machine ELM (Huang et al., 2004), shown in Fig. 2, is a basically a singlehidden-layer feed-forward neural network (SLFN). The difference b/w SLFN and ELM lies in how the weights of hidden layer and output layer neurons are updated. In SLFN, the weights of both input and outputs layers are initialized randomly, and the weights of both the layers are updated by the back-propagation algorithm. In ELM, the weights of hidden layer are assigned randomly but never updated, and only the weights of output layer are updated during training process. As in ELM, the weights of only one layer are to be updated as opposed to both layers of SLFN, this makes ELM faster than SLFN. Let the training dataset be (xj,tj) where xj = [xj1,xj2,. . .,xjN]T are the input vectors and tj is the output vector. The output of the ith hidden layer neuron is given by g(wi,bi,xj), where wi is the weight vector connecting the input neurons to the ith hidden layer neuron, bi is the bias of the ith hidden neuron, and g is the activation function. Each hidden layer neuron of ELM is also connected to each output

layer neuron with some associated weights, we denote the weights connecting the ith hidden layer neuron with output neurons by bi. This architecture can be represented mathematically as, L X

  bi g wi ; bi ; xj ¼ t j

ð2Þ

i¼1

where, L denotes the number of hidden neurons, and j denotes the input/output instance of total N training instances. The above equation can be written in more compact form as

Hb ¼ T

ð3Þ

where, considering m output nodes, we have

3 bT1 6 . 7 7 b¼6 4 .. 5

3 t T1 6 . 7 7 T¼6 4 .. 5

2

bTL

2

and

tTN

Lm

ð4Þ Nm

H is the output matrix of the hidden layer and could be represented as:

2

3 g ðwL ; bL ; x1 Þ 7 7 ... 5 g ðw1 ; b1 ; xN Þ    g ðwL ; bL ; xN Þ

g ðw1 ; b1 ; x1 Þ 6 6 ... H¼4

 .. .

ð5Þ

The smallest norm least squares of (3) is

b b ¼ Hþ T

ð6Þ þ

where, H is the Moore-penrose generalized inverse of the matrix H. Hþ can be calculated through QR method, singular value decomposition (SVD), orthogonalization method, and orthogonal projection method. If we need to regularize the system (to prevent over-fitting), the optimization problem becomes: 2

min kbk þ C

N X

!

2 knTi k

ð7Þ

i¼1

where, ni ¼ t Ti  hðxi Þb is the training error of the ith sample and C is the relevant penalty factor. We can transform this problem into its dual form and construct the following Lagrangian function:

Please cite this article as: A. S. Hashmi and T. Ahmad, GP-ELM-RNN: Garson-pruned extreme learning machine based replicator neural network for anomaly detection, Journal of King Saud University – Computer and Information Sciences, https://doi.org/10.1016/j.jksuci.2019.09.007

4

A.S. Hashmi, T. Ahmad / Journal of King Saud University – Computer and Information Sciences xxx (xxxx) xxx

Fig. 2. Extreme Learning Machine (w is fixed, b is trained).

F ¼ kbk2 þ C

N X

N X L X

knk2 

i¼1

i¼1



aij hðxi Þbj  tij þ nij



ð8Þ

j¼1

Taking the partial derivative of the above equation and applying KKT conditions, we get to know that: If L < N then the size of the matrix HTH is less than that of the matrix HHT,

b ¼ Hþ T ¼



I þ HT H C

1 HT T

ð9Þ

So, the final output of ELM is

 f ðxÞ ¼ hðxÞb ¼ hðxÞ

I þ HT H C

1 HT T

ð10Þ

whereas, if L > N, the size of the matrix HHT is less than that of the matrix HTH, then the solution of the equations become

b ¼ Hþ T ¼ HT



I þ HHT C

1 ð11Þ

T

So, the final output of ELM is

 f ðxÞ ¼ hðxÞb ¼ hðxÞHT

I þ HHT C

1 T

ð12Þ

For the binary classification problem, the decision function of ELM is formulated as:

f ðxÞ ¼ signðhðxÞbÞ

ð13Þ

For multi-class case, the class label of sample is formulated as:

labelðxÞ ¼ arg max ff i ðxÞg

ð14Þ

1im

and

f ðxÞ ¼ ½f 1 ðxÞ; f 2 ðxÞ; f 3 ðxÞ;    ; f n ðxÞ

T

ð15Þ

ELM has been used for prediction and classification tasks in different domains like medical (Zhang et al., 2018), civil engineering (Yaseen et al., 2018), hydrology research (Rezaie-Balf and Kisi, 2018), modeling of chemical process (Zhu et al., 2018), etc.

3. Garson-pruned ELM-RNN The original RNN has three hidden layers, and it utilizes the back-propagation algorithm to update the weights of the neurons. Although this algorithm gives good results for anomaly detection in terms of speed as well as accuracy compared to other algorithms of anomaly detection, but for large datasets, its performance is not satisfactory. The primary reason that RNN is not suitable for large datasets is that it utilizes backpropagation algorithm which updates the muti-set of weights for multiple iterations. If back-propagation is replaced by ELM as learning mechanism then the process speeds up considerably as it has to set the weights of only one layer which it does in only one computation cycle of the dataset. Hence, ELM-RNN is considerably faster than traditional back-propagation based RNN. ELM-based RNN solves the problem of determining the number of hidden layers in RNN as ELM consists of only a single layer. But the problem to determine the number of hidden layer neurons still persists, which can be solved by Garson’s algorithm (Garson, 1991; Goh, 1995) for pruning hidden nodes. Garson’s method works by finding the influence of each hidden layer neuron on the output, and the neurons having influence less than a threshold value are pruned. The method works as follows: Let there be a SLFN with i input neurons and j hidden layer neurons. The output vector is multiplied with the input matrix of hidden layer neurons, to give

2    3 2 3 ^  x1 þ b 1 ^  x1 þ bh    g wh A g w1 A 7 t 11 6 76 . 7 6 .. .. .. 76 . 7 cij ¼ 6 74 . 5 6 . . .   5 4  ^ ^ t N1    g wh A  xN þ bh g w1 A  xN þ b1

ð16Þ

The influence of each hidden layer input on the output is calculated as follows:

jcij j rij ¼ PN i¼1 jc ij j

ð17Þ

Please cite this article as: A. S. Hashmi and T. Ahmad, GP-ELM-RNN: Garson-pruned extreme learning machine based replicator neural network for anomaly detection, Journal of King Saud University – Computer and Information Sciences, https://doi.org/10.1016/j.jksuci.2019.09.007

A.S. Hashmi, T. Ahmad / Journal of King Saud University – Computer and Information Sciences xxx (xxxx) xxx

Next, the influence of each hidden layer neuron on the output is calculated as

Sk ¼ r1k þ r 2k þ r 3k þ    þ r Nk

i¼1 Si

 100

ð19Þ

We determine the threshold for pruning, and remove the hidden layer neurons which are below the threshold. This threshold value can be determined through harmonic mean,

Threshold ¼ Ph

h

1 k¼1 RIðkÞ

itly in the application. Hence, the proposed algorithm is implemented in TensorFlow to get optimal execution speed.

ð18Þ

Finally, the relative importance of each hidden layer neuron is calculated by

Sk RIðkÞ ¼ Ph

5

ð20Þ

After pruning the less influential hidden nodes, we will get a new hidden layer matrix, which should then be used for ELM learning algorithm. Taking output vector same as the input vector, and calculating reconstruction error as in Eq. (1), we get the Replicator Neural network model. The GP-ELM-RNN training algorithm can be summarized as follows:  Given a training set S ¼ xi 2 RN , N input neurons, N output neurons and L hidden neurons. Step-1: Randomly generate input node weight wi and hidden neurons bias bi, i = 1,2,. . .,L. Step-2: Take each training instance xj as input vector as well as the output vector. Step-3: Calculate the hidden layer output matrix H for the training samples. Step-4: Apply Garson’s algorithm on H to get Hnew. Step-5: Calculate the output layer weight b based on the formula (Eqs. (9) and (11)) for Hnew. Step-6: After all the training samples have been utilized and b have been calculated. Take the whole training dataset as test dataset, and calculate the root-mean-square-error over all features for each training instance (called as reconstruction error). Step-7: The samples are ranked according to reconstruction error (higher to lower). Step-8: The ‘k’ samples having highest reconstruction error are marked as outliers. While dealing with large datasets, we need parallel algorithms to reduce the computation time. The proposed GP-ELM-RNN, if implemented on a parallel computing platform would be able to detect anomalies in large datasets in real-time. Although there are many big data platforms for parallel computing like Hadoop, Spark, H2O (machine learning), etc. but a GPU-based solution clearly outperforms any CPU-based solution. A mid-range desktop PC provides 4–8 CPU cores, whereas a mid-range graphics card provides GPU with hundreds of cores. Hence, on desktops, a GPU can provide parallelization to much larger extent compared to a CPU, and processing speeds of GPU is also higher than CPUs. So, if a task can be performed in parallel then GPUs are better choice than CPUs. To perform computations on GPU, CUDA seems to be the first choice, but recently new highlevel tools have emerged which make our task much easier. Some of the popular deep learning tools are TensorFlow, Keras, and Theano. The proposed GP-ELM-RNN, if implemented on a tool like TensorFlow, could give quick results for large datasets than any other possible solution. TensorFlow provides APIs to quickly develop learning algorithms, and the applications developed provide parallelism implicitly without any need to hard code parallelism explic-

4. Experiments and results In this section, the performance of the proposed GP-ELM-RNN algorithm is compared with standard algorithms of anomaly detection – DBSCAN, LOF, k-means (modified) and One-class SVM. KMeans, DBSCAN and LOF are unsupervised learning based algorithms, whereas One-class SVM is a semi-supervised learning algorithm. The experiments were conducted on openly available UCI/ODDS datasets: lymphography, wisconsin breast cancer, post-operative, pageblocks, credit card fraud detection, forest cover and kddcup99 dataset. The datasets are summarized in Table 1 indicating the size of the datasets and count of outliers in the corresponding dataset. The experiments were conducted in two phases: first the performance was evaluated on the basis of accuracy of the algorithms, the metric used for accuracy are jaccard similarity (between actual labels and predicted labels for each instance in the dataset) and specificity (percentage of correctly identified anomalies). Secondly, the execution speed of the proposed solution (algorithm and associated platform) is compared with the current fastest solutions of anomaly detection – AVF (Attribute Value Frequency) on Apache Spark and Replicator Neural Network (vanilla version) on H2O platform.

Table 1 Datasets. Dataset

Dimensions

Outliers Ratio

Description

Lymphography

148  18

6/142

Post-Operative

90  9

26/64

Wisconsin

483  10

39/444

Page-blocks

5053  11

140/ 4913

Credit Card

284807  11

492/ 284315

Forest Cover

286048  10

2747/ 283301

Kdd-cup99

567497  3

2211/ 567497

Originally 4 classes: normal find (label 1), metastasis (label 2), malign lymph (label 3), and fibrosis (label 4). Classes ‘‘normal find” and ‘‘fibrosis” have only 2 and 4 instances, hence treated as outliers. The data points are classified into 3 classes: A (64 points), S (24 points) and I (2 points). The points labeled as S and I are treated as outliers. Following Hawkins (Hawkins et al., 2002), only one of every sixth malignant instance is kept, giving us 39 malignant instances, which are treated as outliers among 444 benign instances. The text blocks are considered as inliers (4913) and non-text blocks (560) are considered as outliers. We will keep only one of every four non-text instances giving 140 outliers. These transactions are done with 172,792 credit cards, hence there are multiple transactions for most of the cards. The original Forest Cover dataset has 581,012 instances (with 54 attributes) which are classified into 7 classes (1–7). We will consider instances of class 2 (inliers) and 4 (outliers) only, and ignore soil type and wilderness attributes. Kddcup99 is an intrusiondetection dataset. Every type of attack is labeled as an outlier.

Please cite this article as: A. S. Hashmi and T. Ahmad, GP-ELM-RNN: Garson-pruned extreme learning machine based replicator neural network for anomaly detection, Journal of King Saud University – Computer and Information Sciences, https://doi.org/10.1016/j.jksuci.2019.09.007

6

A.S. Hashmi, T. Ahmad / Journal of King Saud University – Computer and Information Sciences xxx (xxxx) xxx

Table 2 Accuracy of the algorithms (Jaccard Similarity). Dataset

K-Means

DBSCAN

LOF

SVM

GP-ELM-RNN

Lympho-graphy Post-Operative Cancer Page-blocks Credit Card Forest Cover KDD-Cup99

0.9189 0.5632 0.8674 0.8724 0.9966 0.9886 0.9952

0.0405 0.2888 0.0807 0.0425 0.0017 0.8210 0.9929

0.9121 0.6896 0.8260 0.8753 0.8987 0.8932 0.8963

0.7094 0.6896 0.9213 0.8984 0.9011 0.9077 0.9038

0.9864 0.9111 0.9792 0.9934 0.9999 0.9982 0.9999

Table 3 Accuracy of the algorithms (specificity). Dataset

K-Means

DBSCAN

LOF

SVM

GP-ELM-RNN

Lymphography Post-Operative Cancer Page-blocks Credit Card Forest Cover KDD-Cup99

0.0000 0.2692 0.1794 0.0017 0.0162 0.0000 0.0000

0.0405 0.2888 0.0807 0.3982 0.0017 0.4746 0.0954

0.6666 0.1538 0.0512 0.3803 0.1504 0.1507 0.0312

0.3333 0.2307 0.6410 0.4910 0.9044 0.9049 0.3893

0.8333 0.8461 0.8717 0.8857 0.9857 0.9100 0.5427

Table 4 Execution Time (in seconds). Dataset

Dimensions

AVF (Spark)

RNN (H2O)

GP-ELM-RNN (TensorFlow)

Forest Cover Synthetic-1 Synthetic-2 Synthetic-3 KDD-Cup99 Synthetic-4

286,048  10 300,000  30 500,000  10 1,000,000  10 567,497  41 1,000,000  40

32 ± 1 93 ± 1 42 ± 1 96 ± 1 404 ± 1 910 ± 5

182 ± 5 235 ± 5 450 ± 5 980 ± 5 306 ± 5 640 ± 5

15 ± 1 18 ± 1 23 ± 1 29 ± 1 54 ± 1 87 ± 1

The algorithms are run with default parameters of the corresponding library (sklearn). For k-means algorithm, the number of clusters was set to 8. For DBSCAN, we do not need to set the number of clusters but we need to set the parameter eps (the maximum distance between two samples for them to be considered as in the same neighborhood) and min_samples (the number of instances in the neighborhood for a point to be considered as a core point, including the point itself); eps was given the value 0.5 and min_samples was set to 5. LOF requires the number of neighbors to consider, which was set to 20. For one-class SVM, ‘‘rbf” kernel of degree 3 was used. For GP-ELM-RNN, we set the initial number of hidden layer nodes to be one less than the number of features, and the algorithm then prunes and determines the optimal number of nodes itself. The accuracy achieved by each algorithm for each dataset is presented in Tables 2 and 3. The accuracy of GP-ELM-RNN comes out to be much higher compared to K-Means, DBSCAN, LOF and SVM. The major reason for poor performance of these algorithms compared to ELM-RNN is that they require setting of few parameters which significantly affect the performance of the algorithm and as these algorithms were run with default parameters, their performance was poor. However, the proposed solution is selfoptimizing and does not suffer from such drawbacks. Considering the case of speed, K-Means, DBSCAN and LOF are all distance-based algorithms and the complexity of these algorithms is O(n2) which in case of large datasets is quite problematic. The time complexity of One-class SVM is comparable to RNN but ELM-RNN is expected to be faster than One-class SVM. From the results shown in Tables 2 and 3, it is clear that GPELM-RNN outperforms all the state-of-the-art algorithms in terms

of accuracy. But while handling large datasets, accuracy can be compromised to some extent for speed. Hence, speed of the algorithm is also significant factor while dealing with large datasets. To compare the execution speed of the proposed solution with the fastest alternatives available for anomaly detection on big data, AVF algorithm was implemented on Apache Spark, and the Autoencoder (RNN) available in H2O machine learning library was utilized. The algorithms were run on a single machine incrementally and the execution-time for various datasets was calculated. The RNN implemented on H2O had three hidden layers, each having 50 neurons; and 100 epochs were run to train the RNN. The experiments were performed on low-end NVIDIA GeForce GT 730M graphics card. From the results summarized in Table 4, it is observed that for datasets with lower dimensions, AVF is faster than RNN, whereas RNN outperforms AVF for datasets with high dimensions. However, the proposed solution (GP-ELM-RNN on GPU) outperforms both the algorithms for any type of datasets. 5. Conclusion and future scope Anomaly detection in datasets becomes more and more challenging as the size of the dataset increases. This is because the ratio of anomalous data points to normal data points becomes smaller and smaller, hence the probability of finding correct anomaly decreases. The popular distance-based algorithms like LOF clearly fail while dealing with large datasets. The algorithm for anomaly detection in large datasets should be fast, accurate and scalable. This work was aimed at an optimized anomaly detection algorithm which can handle large datasets, hence Garson-pruned ELM-RNN algorithm was proposed. This algorithm was implemented on TensorFlow so that it can work on CPU as well as GPU. From the experiments it is evident that Garson-pruned ELM-RNN on TensorFlow is best approach for anomaly detection in terms of accuracy as well as speed. In the future directions, pruning methods other than Garson’s algorithm can be explored, and some different neural network model for anomaly detection can also be proposed. Techniques other than neural networks can also be explored but iterative optimization algorithms should be avoided while dealing with large datasets. Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References Chandola, V., Banerjee, A., Kumar, V., 2009. Anomaly detection: a survey. ACM Computing Surveys (CSUR). Hecht-Nielsen, R., 1995. Replicator neural networks for universal optimal source coding. Science 269, 1860–1863. Hawkins, S., He, H., Williams, G., Baxter, R., 2002. Outlier detection using replicator neural networks. In: International Conference on Data Warehousing and Knowledge Discovery, pp. 170–180. Huang, G.B., Zhu, Q.Y., Siew, C.K., 2004. Extreme learning machine: a new learning scheme of feedforward neural networks. In: IEEE International Joint Conference on Neural Networks, pp. 985–990. Bryson, R., Kenwright, S., Cox, D., Ellsworth, M., Haimes, D., 1999. Visually Exploring Gigabyte Data Sets in Real Time. ACM, Commun. Tukey, J.W., 1977. Exploratory Data Analysis. Koufakou, A., Secretan, J., Reeder, J., Cardona, K., Georgiopoulos, M., 2008. Fast parallel outlier detection for categorical datasets using MapReduce. In: Proceedings of IEEE International Joint Conference on Neural Networks, pp. 3298–3304. Liu, B., Fan, W., Xiao, T., 2013. A fast outlier detection method for big data. In: Proceedings of Asian Simulation Conference, pp. 379–384.

Please cite this article as: A. S. Hashmi and T. Ahmad, GP-ELM-RNN: Garson-pruned extreme learning machine based replicator neural network for anomaly detection, Journal of King Saud University – Computer and Information Sciences, https://doi.org/10.1016/j.jksuci.2019.09.007

A.S. Hashmi, T. Ahmad / Journal of King Saud University – Computer and Information Sciences xxx (xxxx) xxx Schölkopf, B., Williamson, R.C., Smola, A.J., Shawe-Taylor, J., Platt, J.C., 2000. Support vector method for novelty detection. In: Advances in Neural Information Processing Systems, pp. 582–588. Knox, E.M., Ng, R.T., 1998. Algorithms for mining distance based outliers in large datasets. In: Proceedings of International Conference on Very Large Data Bases, pp. 392–403. Angiulli, F., Pizzuti, C., 2002. Fast outlier detection in high dimensional spaces. In: Proceedings of European Conference on Principles of Data Mining and Knowledge Discovery, pp. 15–27. Angiulli, F., Pizzuti, C., 2005. Outlier mining in large high-dimensional data sets. IEEE Trans. Knowl. Data Eng. 17, 203–215. Angiulli, F., Basta, S., Pizzuti, C., 2006. Distance-based detection and prediction of outliers. IEEE Trans. Knowl. Data Eng. 18, 145–160. Ramaswamy, S., Rastogi, R., Shim, K., 2000. Efficient algorithms for mining outliers from large data sets. ACM Sigmod Record 29, 427–438. Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al., 1996. A density-based algorithm discovering clusters in large spatial databases with noise. In: Proceedings of Kdd, pp. 226–231. Breunig, Markus M., Kriegel, Hans-Peter, Ng, Raymond T., Sander, Jörg, 2000. LOF: identifying density-based local outliers. SIGMOD Rec. 29 (2), 93–104. https:// doi.org/10.1145/33519110.1145/335191.335388. Papadimitriou, S., Kitagawa, H., Gibbons, P.B., Faloutsos, C., 2003. Loci: fast outlier detection using the local correlation integral. In: 19th International Conference on Data Engineering, pp. 315–326.

7

Yan, Y. et al., 2015. Distributed outlier detection using compressive sensing. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 3–16. Angiulli, F., Basta, S., Lodi, S., Sartori, C., 2010. A distributed approach to detect outliers in very large datasets. In: Euro-Par 2010-Parallel Processing, pp. 329– 340. Zhang, Y., Wang, Y., Zhou, G., Jin, J., Wang, B., Wang, X., Cichocki, A., 2018. Multikernel extreme learning machine for EEG classification in brain-computer interfaces. Expert Syst. Appl. 96, 302–310. Yaseen, Z.M., Deo, R.C., Hilal, A., Abd, A.M., Bueno, L.C., Salcedo-Sanz, S., Nehdi, M.L., 2018. Predicting compressive strength of lightweight foamed concrete using extreme learning machine model. Adv. Eng. Softw. 115, 112–125. Rezaie-Balf, M., Kisi, O., 2018. New formulation for forecasting streamflow: evolutionary polynomial regression vs. extreme learning machine. Hydrol. Res. 49, 939–953. Zhu, Q.X., Wang, X., He, Y.L., Xu, Y., 2018. An improved extreme learning machine integrated with nonlinear principal components and its application to modeling complex chemical processes. Appl. Therm. Eng. 130, 745–753. Garson, G.D., 1991. Interpreting neural-network connection weights. In: AI Expert, pp. 46–51. Goh, A., 1995. Back-propagation neural networks for modeling complex systems. Artif. Intell. Eng., 143–151

Please cite this article as: A. S. Hashmi and T. Ahmad, GP-ELM-RNN: Garson-pruned extreme learning machine based replicator neural network for anomaly detection, Journal of King Saud University – Computer and Information Sciences, https://doi.org/10.1016/j.jksuci.2019.09.007