Optimizing echo state network with backtracking search optimization algorithm for time series forecasting

Optimizing echo state network with backtracking search optimization algorithm for time series forecasting

Engineering Applications of Artificial Intelligence 81 (2019) 117–132 Contents lists available at ScienceDirect Engineering Applications of Artifici...

2MB Sizes 0 Downloads 3 Views

Engineering Applications of Artificial Intelligence 81 (2019) 117–132

Contents lists available at ScienceDirect

Engineering Applications of Artificial Intelligence journal homepage: www.elsevier.com/locate/engappai

Optimizing echo state network with backtracking search optimization algorithm for time series forecasting✩ Zhigang Wang a,b , Yu-Rong Zeng c , Sirui Wang a , Lin Wang a ,∗ a

School of Management, Huazhong University of Science and Technology, Wuhan 430074, China YGsoft Inc., Yuanguang Software Park, Technology and Innovation Coast, Zhuhai 519085, China c Hubei University of Economics, Wuhan 430205, China b

ARTICLE

INFO

Keywords: Echo state network Evolutionary algorithm Backtracking search optimization algorithm Time series forecasting

ABSTRACT The echo state network (ESN) is a state-of-the art reservoir computing approach, which is particularly effective for time series forecasting problems because it is coupled with a time parameter. However, the linear regression algorithm commonly used to compute the output weights of ESN could usually cause the trained network overfitted and thus obtain unsatisfactory results. To overcome the problem, we present four optimized ESNs that are based on the backtracking search optimization algorithm (BSA) or its variants to improve generalizability. Concretely, we utilize BSA and its variants to determine the most appropriate output weights of ESN given that the optimization problem is complex while BSA is a novel evolutionary algorithm that effectively unscrambles optimal solutions in complex spaces. The three BSA variants, namely, adaptive population selection scheme (APSS)–BSA, adaptive mutation factor strategy (AMFS)–BSA, and APSS&AMFS–BSA, were designed to further improve the performance of BSA. Time series forecasting experiments were performed using two real-life time series. The experimental results of the optimized ESNs were compared with those of the basic ESN without optimization, and the two other comparison approaches, as well as the other existing approaches. Experimental results showed that (a) the results of the optimized ESNs are more accurate than that of basic ESN and (b) APSS&AMFS–BSA–ESN nearly outperforms basic ESN, the three other optimized ESNs, the two comparison approaches, and other existing optimization approaches.

1. Introduction Time series forecasting is a dynamic research topic that has attracted the attention of many researchers over the last few decades (Wang et al., 2018a). This field develops appropriate models to predict the future values using a time series of previously observed data. Hence, time series forecasting is considered as the act of predicting the future by understanding the past (Peng et al., 2015). Forecasting has important roles in decision making and has been applied in finance (Esling and Agon, 2012), electricity (Zeng et al., 2017), electromechanical system (Liu et al., 2018a), engineering (Tak-chung, 2011), climatology (Fildes and Kourentzes, 2011; Huang et al., 2018), tourism (Li and Song, 2008), medicine (Lacy et al., 2018), price (Peng et al., 2018), and stock markets (Rubio et al., 2017). Time series forecasting, however, is a challenging research topic because it is invariably a complex nonlinear problem. Thus, a superior model with accurate forecasting capability should be developed to enhance the practical applications of time series forecasting (Lv et al., 2018a).

Numerous modeling techniques have been developed to enhance the forecasting performance, however, no universal guidelines have yet available to determine the more appropriate technique for a specific problem (Wang et al., 2018b) One of the major techniques is to model the time series forecasting problem with artificial neural networks (ANNs). The major advantage of ANNs is their flexible modeling capability that exhibits a self-adaptive and data-driven mechanism. ANNs are universal and powerful approximators (Hornik et al., 1989) that can suggest the appropriate data generation process for linear and nonlinear time series with different forms. ANNs have become one of the most accurate and widely used forecasting models (Crone et al., 2011; Zeng et al., 2017; Zhang et al., 1998). The same powerful approximation capabilities that allow ANNs to model time series data substantially complicate model specification. The selection of an optimal architecture is therefore important in neural network applications (Wang et al., 2018). Recurrent neural networks (RNNs), such as Hopfield, Elman, bidirectional associative memory

✩ No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.engappai.2019.02.009. ∗ Corresponding author. E-mail addresses: [email protected] (Z.G. Wang), [email protected] (Y.-R. Zeng), [email protected] (S.R. Wang), [email protected] (L. Wang).

https://doi.org/10.1016/j.engappai.2019.02.009 Received 24 September 2017; Received in revised form 25 December 2018; Accepted 12 February 2019 Available online xxxx 0952-1976/© 2019 Elsevier Ltd. All rights reserved.

Z.G. Wang, Y.-R. Zeng, S.R. Wang et al.

Engineering Applications of Artificial Intelligence 81 (2019) 117–132

as a feature selection method for determining the optimal connection structures for output weights in ESNs; the proposed algorithm provided better results than least angle regression, the classical feature selection method. (Wang et al., 2015b) presented an adaptive differential evolution (ADE) algorithm to optimize some initial weights and thresholds of back-propagation neural network (BPNN) to improve forecasting accuracy. Similarly, Chouikhi et al. (2017) applied a PSO algorithm to pre-train some fixed weight values of ESNs to improve learning performance. Although EAs are popular stochastic search algorithms that are widely used to solve complex optimization problems, they have some flaws such as excessive sensitivity to control parameters, premature convergence and slow computation. In this situation, the backtracking search optimization algorithm (BSA), proposed by Civicioglu, was motivated by studies that attempt to develop simpler and more effective search algorithms. The main advantages of BSA over other EAs are threefold. Firstly, BSA has a single control parameter and its problem-solving performance is not over sensitive to the initial value of this parameter. In addition, BSA has a simple structure that is effective, fast and capable of solving multimodal problems and that enables it to easily adapt to different numerical optimization problems. Finally, BSA’s strategies for generating trial populations and controlling the amplitude of the search-direction matrix and search-space boundaries give it very powerful exploration and exploitation capabilities. To sum up, BSA is more efficient, less computationally costly, easier to implement, requires fewer parameters, and can converge more quickly to solve complex problems when compared with other EAs. BSA has attracted considerable research attention since its introduction and has been successfully used in various domains, such as numerical optimization (Civicioglu, 2013; Wang et al., 2016a), automatic generation control (Madasu et al., 2017), community detection (Zou et al., 2017), constrained optimization problems (Zhang et al., 2015), and power flow (Chaib et al., 2016). Thus, BSA could be an effective and efficient method for optimizing the output weights of ESNs. In this study, ESN is selected as the basic forecasting model and BSA is used to optimize the output weights of the ESN. The flowchart of this study is shown in Fig. 1. Although BSA has been successfully used in various domains, (Wang et al., 2016a) stated, ‘‘BSA is better in exploration but worse in exploitation, and how to balance the two capabilities is still a challenging task’’. Thus, to address the imbalance between exploration and exploitation, we propose three BSA variants: adaptive population selection scheme (APSS)–BSA, adaptive mutation factor strategy (AMFS)–BSA, and (APSS&AMFS)–BSA. These variants apply three different adaptive strategies to incorporate the main steps of BSA and are likely to improve the performance of the BSA in a controlled manner, and thus improve the performance of the optimized ESNs. Based on the improvements of ESN and BSA for better forecasting performance, the current study provides the following contributions. Firstly, we utilize BSA and BSA variants to optimize the output weights of ESN rather than the traditional linear regression algorithm to overcome its overfitting problem for time series forecasting. Secondly, three BSA variants, namely, adaptive population selection scheme (APSS)BSA, adaptive mutation factor strategy (AMFS)-BSA, and APSS&AMFS– BSA, are proposed to improve BSA performance, and therefore improve the optimized ESNs performance. Finally, some ideas for researchers who intend to dig into the topics about ESNs, BSA, and other related problems are presented. The rest of this paper is organized as follows. A brief review of the standard ESN is provided in Section 2. An explanation of standard BSA and an introduction to its variants, namely, APSS–BSA, AMFS– BSA, and APSS&AMFS–BSA, are given in Section 3. The seven proposed forecasting models, namely, ESN, BSA–ESN, APSS–BSA–ESN, AMFS– BSA–ESN, and APSS&AMFS–BSA–ESN, are described in Section 4. The experiments that were used to test and verify the feasibility and effectiveness of the seven proposed approaches are discussed in Section 5. The conclusions and an outlook for future work are drawn in Section 6.

(BAM) and echo state network (ESN), provide considerable memory by creating an internal network state that allows the network to exhibit dynamic temporal behavior. This kind of networks that pay attention to the dynamic temporal behavior has become more suitable to simulate nonlinear dynamic system than conventional feedforward artificial neural networks (FANNs) (such as back propagation neural network, BPNN) in many fields. Researchers have applied RNNs in many fields, such as clustering (Cherif et al., 2011), pattern recognition (Chatzis and Demiris, 2012), classification (Lacy et al., 2018; Skowronski and Harris, 2007; Wang et al., 2016b), and prediction (Chandra and Zhang, 2012; Ewing and Thompson, 2012; Jaeger and Haas, 2004; Yang et al., 2018). The ESN, which was proposed by (Jaeger, 2001), is one of the most popular RNNs. The hidden layer of ESN is called reservoir, which is randomly initialized. In general, the relevant information of the inputs can be represented through the internal states of a reservoir that is generated when inputs are fed into ESN. Accordingly, ESN can perform nonlinear time-varying mappings of input and target patterns by utilizing the dynamics of the reservoir instead of those of the original inputs. This technique is coupled with a time parameter and thus is highly effective for time series forecasting (Andrawis et al., 2011; Bianchi et al., 2015; Chouikhi et al., 2017; Jaeger and Haas, 2004; Xu et al., 2018). ESN is one of the best-known classes of reservoir computing (RC) (Lukoševičius and Jaeger, 2009). Relative to the other RNNs and FANNs, such as Hopfield, Elman, BAM and BPNN, the most obvious advantages of ESN are its simpler network structure and lower calculation cost (Dutoit et al., 2009; Tong et al., 2007). The hidden layer in the ESN is modeled by a reservoir wherein randomly and sparsely recurrent and connected units interact. Only the reservoir with the output layer weights (readout weights) is trained because of RC. The other parameters of the network, such as the input layer with the reservoir weights and the reservoir connection weights, are randomly selected and remain unaltered until the end of the learning process. The output weights are usually trained in accordance with a simple linear regression algorithm. This training technique can save on calculation costs but can cause excessive overfitting (over-training) problems that are common in neural network modeling. Conceptually, a neural network model is overfitted (over-trained) when the network fits the in-sample training data well but produces poor out-of-sample results. The output weights of ESN may be unsuitable for testing given that they are obtained in accordance with conventional simple linear regression, which would make the network fit the training process very well. Hence, the output weights of ESN should be optimized through other techniques to improve performance. The optimization of the output weights of ESNs is a concerned about how the reservoir and output layer units are connected. The task is to optimize the size of weights and is classified in the category of discrete optimization. Moreover, the dimension of the optimization task is equal to the size of reservoir in a general case (the number of units in the output layer is set to 1.). Note that the size of reservoir (generally 100–1000 units) must be large enough to capture potential data features given that the reservoir is randomly generated. Thus, the optimization of the output weights of ESNs is a discrete, highdimensional, and complexly nonlinear problem. Many studies have reported that evolutionary algorithms (EAs) based on optimization techniques can efficiently solve such computational problems. Evolutionary algorithms (EAs), which are inspired by biological evolution, can be used to optimize neural networks. Kobialka and Kayani (2010) used two state-of-the-art feature selection algorithms to exclude irrelevant internal states from directly contributing to the output units of ESN. (Mirjalili et al., 2012) proposed gravitational search algorithm (GSA) and a hybrid of particle swarm optimization (PSO) and GSA (denoted as PSOGSA) to train feed-forward neural networks in order to resolve the problems of trapping in the local minima and the slow convergence rate of current evolutionary learning algorithms. Wang and Yan (2015) proposed and employed a binary PSO 118

Z.G. Wang, Y.-R. Zeng, S.R. Wang et al.

Engineering Applications of Artificial Intelligence 81 (2019) 117–132

Fig. 1. Flowchart for time series forecasting used in this study.

2. Echo state network As illustrated in Fig. 2, the standard ESN can be partitioned into three components: an input layer (𝐾 units), a hidden layer or reservoir (𝑁 units), and an output layer (𝐿 units). The activation of the input, the internal units, and the output at time step 𝑛 are denoted as (1), (2), and (3), respectively. 𝑢(𝑛) = (𝑢1 (𝑛), 𝑢2 (𝑛), … , 𝑢𝐾 (𝑛))T ,

(1)

𝑥(𝑛) = (𝑥1 (𝑛), 𝑥2 (𝑛), … , 𝑥𝑁 (𝑛))T ,

(2)

𝑦(𝑛) = (𝑦1 (𝑛), 𝑦2 (𝑛), … , 𝑦𝐿 (𝑛))T .

(3)

At the time step 𝑛, the input series 𝑢(𝑛) will be fed into the network through connections assembled in the input weight matrix Win with a size of 𝑁 × 𝐾. The connection weights between reservoir units are given in a weight matrix W with a size of 𝑁 × 𝑁. The connection weights between all units in the network and output units are given in a 𝐿 × (𝐾 + 𝑁 + 𝐿) weight matrix Wout . The weight matrix Wback with a size of 𝐿×𝑁 represents the connection weights from the output layer to the reservoir. The elements of matrices Win , W, and Wback are randomly fixed before training and remain unchanged given the characteristics of the ESN. However, according to (Jaeger, 2001), the reservoir internal connection matrix W must be a sparse matrix with a spectral radius smaller than 1 to retain the ‘‘echo state property’’. Then, the activation of the internal units at time step 𝑛+1 is updated according to Eq. (4) as follows: 𝑥(𝑛 + 1) = 𝑓 (Win 𝑢(𝑛 + 1) + W𝑥(𝑛) + Wback 𝑦(𝑛)),

Fig. 2. Architecture of the standard ESN. The solid lines represent fixed connections, and the dotted lines represent optional connections.

time step 𝑛 + 1 can be calculated according to Eq. (5). 𝑦(𝑛 + 1) = 𝑓 𝑜𝑢𝑡 (Wout (𝑢(𝑛 + 1), 𝑥(𝑛 + 1), 𝑦(𝑛))),

(5)

where 𝑓 𝑜𝑢𝑡 is the readout function of the output units and is usually set as the identity function. Wout is the trained output weight matrix. The supervised ESN training comprises two crucial steps. In the first step, the input signals are fed to the reservoir and then mapped onto a high-dimensional state space through nonlinear transformation. For the purpose of eliminating the influence of arbitrary initial states on the system, it is always to collect the network state from given time step since the input signals are fed. Suppose we begin

(4)

where 𝑓 is the activation function of the reservoir units, and the Sshaped function 𝑡𝑎𝑛ℎ is usually adopted. The output of the network at 119

Z.G. Wang, Y.-R. Zeng, S.R. Wang et al.

Engineering Applications of Artificial Intelligence 81 (2019) 117–132

where mutation factor 𝐹 controls the amplitude of the search-direction matrix (𝑜𝑙𝑑𝑃 − 𝑃 ). (IV) Crossover. Generate the final form of the trial population 𝑇 through Eq. (15). { 𝑀𝑢𝑡𝑎𝑛𝑡𝑖,𝑗 , 𝑚𝑎𝑝𝑖,𝑗 = 1 𝑇𝑖,𝑗 = , (15) 𝑃𝑖,𝑗 , otherwise

to collect the network state with a sequence of 𝑄 desired input– output pairs {(𝑢(1), 𝑦(1)),(𝑢(2), 𝑦(2)), … , (𝑢(𝑄), 𝑦(𝑄))} being fed to the reservoir. The collected internal states are gathered in a sequence {𝑥(1), 𝑥(2), … , 𝑥(𝑄)}. The desired output is used instead in Eq. (4) because the output of the ESN is unavailable for feedback. The output weight matrix Wout is computed in the second step. The state matrix X and output vector Y are defined as follows: 𝑢(1) ⎡ X=⎢ 𝑥(1) ⎢ 𝑜𝑢𝑡 −1 ⎣(𝑓 ) (𝑦(1)) [ Y = (𝑓 𝑜𝑢𝑡 )−1 (𝑦(1))

⋯ ⋯

𝑢(𝑄) ⎤ ⎥, 𝑥(𝑄) ⎥ (𝑓 𝑜𝑢𝑡 )−1 (𝑦(𝑄)) ⎦ ] (𝑓 𝑜𝑢𝑡 )−1 (𝑦(𝑄)) .

where 𝑚𝑎𝑝 is initialized by 1 and is calculated by { 𝑚𝑎𝑝𝑖,𝑢(1 ∶ ⌈𝑚𝑖𝑥𝑟𝑎𝑡𝑒⋅𝑟𝑎𝑛𝑑⋅𝐷⌉) = 0, 𝑎 < 𝑏|𝑢 = 𝑝𝑒𝑟𝑚𝑢𝑡𝑖𝑛𝑔 (⟨1, 2, … , 𝐷⟩) . 𝑚𝑎𝑝𝑖,𝑟𝑎𝑛𝑑𝑖(𝐷) = 0, otherwise

(6) (7)

In Eq. (16), ⌈⌉ indicates the ceiling function. 𝑟𝑎𝑛𝑑, 𝑎, and 𝑏 are three randomly generated values defined as 𝑟𝑎𝑛𝑑, 𝑎, 𝑏 ∼ 𝑈 (0, 1), where 𝑈 (0, 1) is a standard uniform distribution. 𝑟𝑎𝑛𝑑𝑖 (∙) is a function that generates a pseudorandom integer from a uniform discrete distribution with an interval (0, 𝐷). The mix rate parameter (𝑚𝑖𝑥𝑟𝑎𝑡𝑒) controls the number of elements of individuals that will mutate in an iteration by using ⌈𝑚𝑖𝑥𝑟𝑎𝑡𝑒 ∙ 𝑟𝑎𝑛𝑑 ∙ 𝐷⌉. In addition, the boundary control mechanism is used to check the obtained trial population for individual elements that are out of the bound of the given search space. Such elements are replaced as follows:

The optimal output weight matrix Wout is then obtained by solving the following regularized least square problem. Wout =

arg min ̃ out ∈R𝐿×(𝐾+𝑁+𝐿) W

‖ ̃ out ‖2 ‖ ̃ out ‖2 ‖ , ‖W X − Y‖ + 𝜆 ‖W ‖ ‖ ‖ ‖

(8)

where 𝜆 ∈ R+ is a positive scalar known as the regularization factor and ‖∙‖ denotes the Euclidean norm. The solution of the problem indicated by Eq. (8) can be obtained in closed form as: Wout = ((XXT + 𝜆2 I)−1 XYT )T .

(16)

(9)

𝑇𝑖,𝑗 = 𝑟𝑎𝑛𝑑 ∙ (𝑢𝑝𝑗 − 𝑙𝑜𝑤𝑗 ) + 𝑙𝑜𝑤𝑗 .

In this paper, we assume that no connections exist between input and output units and among output units and that no feedback connections are present. All optional connections (dotted arrows in Fig. 2) are not chosen. Moreover, the output weight matrix Wout is no longer obtained in accordance with Eq. (9) but optimized through BSA and its variants.

(17)

(V) Selection-II. Update the population 𝑃 . The individual 𝑃𝑖 in the population 𝑃 is replaced with the individual 𝑇𝑖 in the trial population 𝑇 if 𝑇𝑖 has better fitness value than 𝑃𝑖 . If the best individual of 𝑃 (𝑃𝑔𝑏𝑒𝑠𝑡 ) has a better fitness value than the current global optimal individual 𝑃𝑏𝑒𝑠𝑡 , the global optimal individual 𝑃𝑏𝑒𝑠𝑡 is updated as 𝑃𝑔𝑏𝑒𝑠𝑡 .

3. BSA

3.2. BSA variants

3.1. Standard BSA

3.2.1. APSS–BSA Similar to other EAs, BSA is superior in dealing with complex optimization problems. Nevertheless, its performance can still be improved: On the one hand, BSA has a memory pool that randomly stores the members of the previous iteration’s population, which is combined with the current population to generate the search-direction matrix. This random selection strategy cannot guarantee a high probability of selecting a good population and thus may increase the optimization time. On the other hand, the mutation factor 𝐹 is given in a nonoptimal form in standard BSA. However, this parameter balances search capability and convergence and is thus sensitive to the performance of BSA. Therefore, the performance of BSA can be improved if the population selection strategy and the mutation factor 𝐹 are optimized. We then propose three variants of BSA to enhance its performance. The instructions are as follows.3.2.1. APSS–BSA The population selection strategy has a considerable influence on the performance of EAs, including BSAs. We propose the first BSA variant, APSS–BSA, which applies APSS in lieu of the standard random selection strategy in the selection-I operator of BSA. The selection-I operator in the standard BSA determines the historical population 𝑜𝑙𝑑𝑃 to be used for calculating the search direction. The standard random selection strategy in the selection-I operator can retain randomness but cannot guarantee a high probability of selecting the good population of the previous iteration as the current historical population. APSS is proposed to overcome this problem by replacing the standard random selection strategy to dynamically determine the chosen historical population. The APSS is based on the roulette wheel strategy, thereby increasing the probability of selecting a good historical population. In a limited searching time, this strategy could produce exploration that is more accurate. Algorithm 2 shows the pseudocode of APSS. In Algorithm 2, the fitness value of the population 𝑃 (𝑖) (𝑖 = 1, 2, … , 𝐺), where 𝐺 is the present iteration number, is calculated in three steps. First, the fitness of each individual 𝑃𝑗(𝑖) (𝑗 = 1, 2, … , 𝑝𝑜𝑝𝑆𝑖𝑧𝑒), which denotes 𝑓 𝑖𝑡𝑛𝑒𝑠𝑠𝑃 (𝑖) , is calculated. Second, the individual fitness values are

BSA is categorized as an EA because it computes optimal solutions by generating trial individuals. The trial individual from the search space with the best fitness would be the solution to the current problem. The general structure of BSA is given below in Algorithm 1, in which optimization functions are divided into five operators as is usually done in other EAs: initialization, selection-I, mutation, crossover and selection-II. These operators can be explained as follows: (I) Initialization. Initialize the populations with Eqs. (10) and (11). 𝑃𝑖,𝑗 ∼ 𝑈𝑗 (𝑙𝑜𝑤𝑗 , 𝑢𝑝𝑗 ),

(10)

𝑜𝑙𝑑𝑃𝑖,𝑗 ∼ 𝑈𝑗 (𝑙𝑜𝑤𝑗 , 𝑢𝑝𝑗 ),

(11)

for 𝑖 = 1, 2, … , 𝑝𝑜𝑝𝑆𝑖𝑧𝑒 and 𝑗 = 1, 2, … , 𝐷, where 𝑝𝑜𝑝𝑆𝑖𝑧𝑒 and 𝐷 are the population size and the dimension of an individual, respectively; 𝑈𝑗 represents the uniform distribution from the lower bound 𝑙𝑜𝑤𝑗 to the upper bound 𝑢𝑝𝑗 for the 𝑗-dimension; and 𝑃𝑖 and 𝑜𝑙𝑑𝑃𝑖 are the target individuals in populations 𝑃 and 𝑜𝑙𝑑𝑃 , respectively. (II) Selection-I. Redefine the historical population 𝑜𝑙𝑑𝑃 to be used for determining the search direction at the beginning of each iteration through the ‘‘if-then’’ rule by Eq. (12). 𝑖𝑓 𝑎 < 𝑏 𝑡ℎ𝑒𝑛 𝑜𝑙𝑑𝑃 = 𝑃 |𝑎, 𝑏 ∼ 𝑈 (0, 1),

(12)

where 𝑈 (0, 1) is a standard uniform distribution. After 𝑜𝑙𝑑𝑃 is redefined, a random shuffle function is applied to randomly change the order of the individuals in 𝑜𝑙𝑑𝑃 through Eq. (13): 𝑜𝑙𝑑𝑃 = 𝑝𝑒𝑟𝑚𝑢𝑡𝑖𝑛𝑔(𝑜𝑙𝑑𝑃 ).

(13)

(III) Mutation. Generate an initial form 𝑀𝑢𝑡𝑎𝑛𝑡 of the trial population 𝑇 using Eq. (14). 𝑀𝑢𝑡𝑎𝑛𝑡 = 𝑃 + 𝐹 × (𝑜𝑙𝑑𝑃 − 𝑃 ),

(14)

𝑗

120

Z.G. Wang, Y.-R. Zeng, S.R. Wang et al.

accumulated and the total fitness 𝑓 𝑖𝑡𝑛𝑒𝑠𝑠𝑡𝑜𝑡𝑎𝑙(𝑃 (𝑖) ) =

Engineering Applications of Artificial Intelligence 81 (2019) 117–132

∑𝑝𝑜𝑝𝑆𝑖𝑧𝑒 𝑗=1

operator of BSA, determines the search direction as the algorithm iterates by controlling the amplitude of the search-direction matrix (𝑜𝑙𝑑𝑃 −𝑃 ). A mutation factor 𝐹 with a large value is effective for global search but obtains a low-accuracy global optimal solution. Meanwhile, a low 𝐹 value can accelerate convergence but cannot ensure population diversity. 𝐹 is sensitive to the performance of BSA because it balances the search capability and convergence. Given the importance of the mutation factor 𝐹 , we propose another BSA variant, AMFS–BSA, which updates the mutation factor 𝐹 in each generation in accordance with

𝑓 𝑖𝑡𝑛𝑒𝑠𝑠𝑃 (𝑖) 𝑗

is obtained. Finally, the average fitness 𝑓 𝑖𝑡𝑛𝑒𝑠𝑠𝑖 = 𝑓 𝑖𝑡𝑛𝑒𝑠𝑠𝑡𝑜𝑡𝑎𝑙(𝑃 (𝑖) )∕𝑝𝑜𝑝𝑆𝑖𝑧𝑒 is calculated. The fitness value is the reciprocal of the error obtained between the predicted value and the actual value according to the selected error measure function. 3.2.2. AMFS–BSA Parameter setting is also playing an important role in the performance of EAs. The mutation factor 𝐹 , which exists in the mutation 121

Z.G. Wang, Y.-R. Zeng, S.R. Wang et al.

Engineering Applications of Artificial Intelligence 81 (2019) 117–132 Table 1 Parameters used in the current study.

AMFS instead of keeping 𝐹 constant. AMFS is inspired by ADE (Wang et al., 2015b) and is shown in Eq. (18). In the stage of initial iterations, a large 𝐹 is adopted to guarantee population diversity of BSA. However, a low value of 𝐹 is used as the algorithm iterates, thereby preventing the superior individuals from being discarded. This strategy could simultaneously consider the exploration and exploitation of BSA and try to balance them. 𝐹 = 𝐹min + (𝐹max − 𝐹min ) × 𝑒

𝐺𝑒𝑛𝑀 1− 𝐺𝑒𝑛𝑀−𝐺+1

,

ESN

(18)

BSA

Parameter name

Description

Parameter name

Description

𝑁 𝑆𝑃

Reservoir size Sparseness

𝑝𝑜𝑝𝑆𝑖𝑧𝑒 𝐺𝑒𝑛𝑀

𝑆𝑅 𝑊𝐿

Spectral radius Washout length

𝐹 𝐹min

Population size Maximum iteration number Mutation factor Minimum mutation factor Maximum mutation factor Gene ranges

where 𝐹max and 𝐹min denote the maximum and minimum values of the mutation factor 𝐹 , respectively. 𝐺𝑒𝑛𝑀 and 𝐺 are the maximum iteration number and the present iteration number, respectively.

𝐹max [𝑙𝑜𝑤𝑗 , 𝑢𝑝𝑗 ], 𝑗 = 1, 2, … , 𝐷

3.2.3. APSS&AMFS–BSA The third BSA variant is APSS&AMFS–BSA, which simultaneously applies APSS and AMFS to replace the corresponding strategies in standard BSA. APSS and AMFS are combined based on the consideration that BSA hinges on the population selection and control parameter setting strategies. These two strategies may promote each other. Thus, APSS&AMFS–BSA will effectively help enhance search capability and accelerate convergence speed. It can also be said that APSS&AMFS could help balance the searching exploration and exploitation of BSA. Although BSA is superior in dealing with complex optimization problems, its performance can still be improved. The above variants of BSA are proposed to enhance the performance of BSA and are likely to promote its explorative behavior in a controlled manner.

𝐺 = 𝐺𝑒𝑛𝑀), is satisfied. The best individual can be obtained, otherwise, carry out the next step. Step 4: Generate the trial population 𝑇 according to the selection-I, mutation, and crossover operators. Step 5: Evaluate the fitness values of the trial population 𝑇 . Update the population 𝑃 according to the selection-II operator. Compare the best individual in the present iteration with the obtained global optimal individual, the better one is set as the present global best individual. Step 6: Let 𝐺 = 𝐺 + 1, then back to Step 3. Step 7: The best individual obtained from BSA is designated as the optimal output weight of ESN. Thus, the ESN is well trained. Step 8: The best-fitting ESN is fitted with the testing inputs to forecast the testing outputs. Update the population 𝑃 . The individual 𝑃𝑖 in the population 𝑃 is replaced with the individual 𝑇𝑖 in the trial population 𝑇 if 𝑇𝑖 has better fitness value than 𝑃𝑖 . If the best individual of 𝑃 (𝑃𝑔𝑏𝑒𝑠𝑡 ) has a better fitness value than the current global optimal individual 𝑃𝑏𝑒𝑠𝑡 , the global optimal individual 𝑃𝑏𝑒𝑠𝑡 is updated as 𝑃𝑔𝑏𝑒𝑠𝑡 .

4. Regulation of ESN based on BSA and its variants In our simulations, we concentrate on ESN for forecasting. Meanwhile, the output weights of ESN are either optimized with basic BSA and the BSA variants, or are not optimized at all. Therefore, we propose the following forecasting models.

5. Numerical examples and result analysis

(1) ESN: standard ESN using Eq. (9) to calculate the output weights, (2) BSA–ESN: ESN that applies basic BSA instead of than Eq. (9) to optimize the output weights, (3) APSS–BSA–ESN: ESN that applies APSS–BSA to optimize the output weights, (4) AMFS–BSA–ESN: ESN that applies AMFS–BSA to optimize the output weights, and (5) APSS&AMFS–BSA–ESN: ESN that applies APSS&AMFS–BSA to optimize the output weights.

5.1. Experimental setup In the following section, we provide a comprehensive experimental evaluation of the five proposed forecasting models using two real-life time series benchmarks. In addition, we design two other models, using BSA to optimize backpropagation neural network (BSA–BPNN) and genetic algorithm (GA) to optimize the ESN (GA–ESN) for the purpose of comparison. All experiments were implemented using MATLAB 2016b and performed on a personal computer with an Intel(R) Core(TM) i7-4790 K CPU @4.00 GHz, 8 GB main memory, and Windows 7 Professional operating system.

Of the above models, models (2), (3), (4), and (5) are optimized with basic BSA or its variants. BSA and its variants are designed to optimize the output weight matrix Wout of ESN. The dimension of each individual, denoted as 𝐷, is identical to the reservoir size of ESN in a general scenario in which the number of output units is set to 1. A group of individuals is obtained from each BSA iteration. An output value 𝑦̂𝑡 (𝑡 = 1, 2, … , 𝑘), where 𝑘 is the number of predictive values, can be calculated according to each individual. The fitness value is the reciprocal of the error obtained between the output value 𝑦̂𝑡 and the actual value 𝑦𝑡 according to the selected error measure function. A flowchart showing the optimization of the output weights of ESN through BSA (basic BSA or its variants) is shown in Fig. 3. The optimization procedure is as follows: Step 1: Set parameters and collect the state matrix of training inputs. All the parameters used in this study are shown in Table 1 and are initialized. The state matrix of the training inputs is collected according to Eqs. (4) and (6). Step 2: Initialize the populations 𝑃 and 𝑜𝑙𝑑𝑃 by using Eqs. (10) and (11). Set 𝐺 = 0. Step 3: Judge the iteration of BSA should be completed or not. The BSA iteration is stopped if the termination condition (i.e., equation

5.1.1. Data sets Two real-life time series benchmarks are used to test the effectiveness of the seven proposed forecasting models. The two time series have exhibited nonlinear properties, such as fluctuations and cyclical trends, and thus have been chosen to verify the performances of various nonlinear models that focus on time series forecasting (Zhang, 2003; Ghiassi and Saidane, 2005; Khashei and Bijari, 2012; Adhikari, 2015). The primary characteristic of ANNs, including ESN, is their strong nonlinear modeling capacity. Thus, the two time series are very suitable and effective for testing the effectiveness of the ESN and the optimized ESNs. Concretely, the first time series is the Canadian lynx time series, which is a record of lynx trapped per year in the Mackenzie River District in northern Canada from 1821 to 1934 (Campbell and Walker, 1977). And, the second time series is the sunspot time series, a well-known time series that indicates a record of annual visible spots observed on the face of the sun from 1700 to 1987 (Cottrell et al., 1995). 122

Z.G. Wang, Y.-R. Zeng, S.R. Wang et al.

Engineering Applications of Artificial Intelligence 81 (2019) 117–132

Fig. 3. Optimization of the output weights of ESN through BSA (basic BSA or its variants).

5.1.2. Data preprocessing Raw data sets are processed at the beginning of the experiments to improve the performances of the seven proposed forecasting models. The data sets are normalized to fit in [0, 1] by adopting linear transformation through Eq. (19). 𝑥 − 𝑥𝑚 𝑖 𝑛 , (19) 𝑥= 𝑥𝑚 𝑎 𝑥 − 𝑥 𝑚 where 𝑥𝑚𝑎𝑥 and 𝑥𝑚𝑖𝑛 denote the maximum and minimum values of the data set, respectively. As is done in the literature (Adhikari, 2015; Wang et al., 2015b), the Canadian lynx data are preprocessed using logarithms (to the base of 10) to achieve a good fit and then preprocessed. This process is implemented to prevent attributes in wide numerical ranges from dominating those in small numerical ranges.

𝑘 1∑ (𝑦̂ − 𝑦𝑡 )2 , 𝑘 𝑡=1 𝑡

𝑘 1 ∑| | |𝑦̂ − 𝑦̂| , | 𝑘 𝑡=1 | 𝑡

(21)

MAE =

𝑘 1 ∑| 𝑦̂ − 𝑦𝑡 || , 𝑘 𝑡=1 | 𝑡

(22)

where 𝑦̂𝑡 and 𝑦𝑡 denote the predictive value and actual value, respectively. 𝑘 is the number of predictive values. 𝑦̂ denotes the average value ∑ of the predictive values, which is calculated by 𝑦̂ = 𝑘1 𝑘𝑡=1 𝑦̂𝑡 . 5.1.4. Parameter setting The performance of the proposed approach primarily depends on the performance of the adopted ESN. In turn, ESN performance is influenced by the selection of its structure and corresponding parameters. For our one-step-ahead forecasting tasks, the ESN is designed with one unit in the output layer. The numbers of input units and hidden units (i.e., the reservoir size) are selected from the set {1, 2, … , 10} and {200, 300, 500}, respectively. Thus, 300 trial candidates are investigated for each of the six ESN and optimized ESN models, and the optimal candidate is selected as the final determinate model. The other corresponding parameters, such as the reservoir sparseness (𝑆𝑃 ) and spectral

5.1.3. Performance estimation In order to measure the performance of a time series forecasting model, several measurements can be employed. In this study, forecasting performance is evaluated by adopting three frequently employed evaluation metrics in the literature: MSE, MAD, and MAE (Zhang, 2003; Ghiassi and Saidane, 2005; Khashei and Bijari, 2012; Adhikari, 2015). They are defined as follows: MSE =

MAD =

(20) 123

Z.G. Wang, Y.-R. Zeng, S.R. Wang et al.

Engineering Applications of Artificial Intelligence 81 (2019) 117–132

Table 2 Twenty benchmark functions. No. F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12

F13

F14 F15

F16

F17 F18 F19 F20

Functions ∑𝑛 𝑓 (𝑥) = 𝑖=2 𝑖𝑥2𝑖 )2 ( )2 ∑𝑛 ( 2 𝑓 (𝑥) = 𝑖=2 𝑖 2𝑥𝑖 − 𝑥2𝑖−1 + 𝑥1 − 1 ( ∑𝑛 2 ) 𝑓 (𝑥) = − exp −0.5 𝑖=1 𝑥𝑖 ∑𝑛 ( ) 𝑖−1 𝑓 (𝑥) = 𝑖=1 106 𝑛−1 𝑥2𝑖 )2 ∑𝑛 (∑𝑖 𝑓 (𝑥) = 𝑖=1 𝑥 𝑗=1 𝑗 ( ) 𝑓 (𝑥) = max ||𝑥𝑖 || ∑𝑛 ∏𝑛 𝑓 (𝑥) = 𝑖=1 ||𝑥𝑖 || + 𝑖=1 ||𝑥𝑖 || ∑𝑛 2 𝑓 (𝑥) = 𝑖=1 𝑥𝑖 )2 ∑𝑛 ( 𝑓 (𝑥) = 𝑖=1 ⌊𝑥𝑖 + 0.5⌋ ) ( √ ∑ ( ∑ ( )) 𝑛 𝑛 𝑓 (𝑥) = −20 exp −0.2 1𝑛 𝑖=1 𝑥2𝑖 − exp 1𝑛 𝑖=1 cos 2𝜋𝑥𝑖 + 20 + 𝑒 ( ) ∑𝑛 | | 𝑓 (𝑥) = 𝑖=1 |𝑥𝑖 sin 𝑥𝑖 + 0.1𝑥𝑖 | | | ( ) ( ) ( ) 𝑓 (𝑥) = 𝑓1 𝑥1 , 𝑥2 + 𝑓1 𝑥2 , 𝑥3 + ⋯ + 𝑓1 𝑥𝑛 , 𝑥1 ) ) ( ( (√ ( ))2 𝑥2 + 𝑦2 − 0.5 ∕ 1 + 0.001 𝑥2 + 𝑦2 𝑓1 (𝑥, 𝑦) = 0.5 + sin2 { ( ) ( ) [ ( )] ( )2 } ∑ 2 𝑛−1 𝑓 (𝑥) = 𝜋𝑛 10 sin2 𝜋𝑦𝑖 + 𝑖=1 𝑦𝑖 − 1 1 + 10 sin2 𝜋𝑦𝑖+1 + 𝑦𝑛 − 1 ( ) ∑𝑛 + 𝑖=1 𝜇 𝑥𝑖 , 10, 100, 4 ⎧𝑘 (𝑥 − 𝑎)𝑚 , 𝑥 > 𝑎 𝑖 𝑖 ( ) ( ) ⎪ 𝑦𝑖 = 1 + 0.25 𝑥𝑖 + 1 , 𝜇 𝑥𝑖 , 𝑎, 𝑘, 𝑚 = ⎨0, −𝑎 < 𝑥𝑖 < 𝑎 ( ) ⎪𝑘 −𝑥 − 𝑎 𝑚 , 𝑥 < −𝑎 𝑖 𝑖 ⎩ ( ) ∏𝑛 𝑥𝑖 1 ∑𝑛 2 √ +1 𝑓 (𝑥) = 4000 𝑖=1 𝑥𝑖 − 𝑖=1 cos 𝑖 ( √ )) ( ( 𝑥2 +𝑥2 +0.5𝑥 𝑥 ) ∑𝑛−1 𝑖 𝑖+1 𝑖 𝑖+1 cos 4 𝑥2𝑖 + 𝑥2𝑖+1 + 0.5𝑥𝑖 𝑥𝑖+1 𝑓 (𝑥) = − 𝑖=1 exp − 8 ( ) ( ) ( ) 𝑓 (𝑥) = 𝑓1 𝑥1 , 𝑥2 + 𝑓1 𝑥2 , 𝑥3 + ⋯ + 𝑓1 𝑥𝑛 , 𝑥1 , 𝑓1 (𝑥, 𝑦) [ ( ( )) ] ( )0.25 ( )0.1 = 𝑥2 + 𝑦2 sin2 50 𝑥2 + 𝑦2 +1 )2 ( ) (√ sin2 100𝑥2𝑖 +𝑥2𝑖+1 −0.5 ∑𝑛−1 𝑓 (𝑥) = 𝑖=1 0.5 + 2 1+0.001(𝑥2𝑖 −2𝑥𝑖 𝑥𝑖−1 +𝑥2𝑖+1 ) ( ) ) ∑𝑛 ( 𝑓 (𝑥) = 𝑖=1 𝑥2𝑖 − 10 cos 2𝜋𝑥𝑖 + 10 { ( ) ) ∑𝑛 ( 𝑥𝑖 ( ) 𝑓 (𝑦) = 𝑖=1 𝑦2𝑖 − 10 cos 2𝜋𝑦𝑖 + 10 , 𝑦𝑖 = 𝑟𝑜𝑢𝑛𝑑 2𝑥𝑖 ∕2 ( (√ ) ) √ ∑𝑛 2 ∑𝑛 2 𝑓 (𝑥) = 1 − cos 2𝜋 𝑥 + 0.1 𝑥 𝑖=1 𝑖 𝑖=1 𝑖

radius (𝑆𝑅) of the internal connection matrix W, are prepared in accordance with the study of (Jaeger, 2001). In the current simulation, the numbers of the input and hidden units are set as mentioned earlier. The sparseness (𝑆𝑃 ) of W is selected as 5%, and the spectral radius (𝑆𝑅) is set to 0.8. The washout length (𝑊 𝐿) is set to 0. The connection weight matrices W and Win are randomly fixed before training and remain unchanged. All the elements in W and Win are specified within the range [−1, 1]. Parameters used in the BSA also play important roles in the performance of the proposed approach. These parameters are set on the basis of a series of experiments or in accordance with existing literature. Concretely, they are specified as follows: (a) The population size is set to 𝑝𝑜𝑝𝑆𝑖𝑧𝑒 = 100; (b) the maximum iteration is set to 𝐺𝑒𝑛𝑀 = 100; (c) the mix rate parameter is set to 𝑚𝑖𝑥𝑟𝑎𝑡𝑒 = 1.0 (Civicioglu, 2013); and (d) the original mutation factor 𝐹 is set to 𝐹 = 3 ∙ 𝑟𝑎𝑛𝑑𝑛 with 𝑟𝑎𝑛𝑑𝑛 ∼ 𝑁(0, 1), 𝑁(0, 1) set as the standard normal distribution (Civicioglu, 2013). The maximum and minimum values of the mutation factor 𝐹 are set to 𝐹max = 0.9 and 𝐹min = 0.1, respectively. A more detailed experiment is described in Section 5.2 to show how these parameters are determined. Parameters used in the BPNN and GA are important for the performance of the proposed approach. According to the experiences of (Wang et al., 2015a,b) and necessary experiments, the parameters are set as follows: For the BPNN model, the number of hidden layer is set to 1. The numbers of input units and hidden units are selected from the set {1, 2, … , 10} and {10, 20, 30}, respectively. For the metaheuristic GA, the parameters are set according to a series of experiments. More

|𝑥𝑖 | < 1∕2 | | |𝑥𝑖 | ≥ 1∕2 | |

𝑥𝑖

x*

f(x*)

(−5.12, 5.12)

0

0

(−10, 10)

0

0

(−1, 1)

0

−1

(−100, 100)

0

0

(−100, 100)

0

0

(−100, 100)

0

0

(−10, 10)

0

0

(−100, 100)

0

0

(−100, 100)

0

0

(−32, 32)

0

0

(−10, 10)

0

0

(−100, 100)

0

0

(−50, 50)

−1

0

(−600, 600)

0

0

(−5, 5)

0

1−𝑛

(−32, 32)

0

0

(−100, 100)

0

0

(−5.12, 5.12)

0

0

(−5.12, 5.12)

0

0

(−100, 100)

0

0

precisely, they are set as follows: (a) The population size is set to 100; (b) the maximum iteration is set to 100; (c) the crossover rate is set to 0.6; (d) the mutation rate is set to 0.01. 5.1.5. Performance measures of BSA and its three variants To verify the performance of BSA and its three variants, namely, adaptive population selection scheme (APSS)-BSA, adaptive mutation factor strategy (AMFS)-BSA, and APSS&AMFS–BSA, twenty benchmark functions including multimodal and unimodal are selected. These functions are widely used in related literatures (Pan et al., 2014; Wang et al., 2015a; Lv et al., 2018b) and they are shown in Table 2. In our benchmark functions studies, each benchmark function problem is run twenty independent replications for APSS&AMFS–BSA, APSSBSA, AMFS–BSA, and BSA while the dimension is set to 30. Then, the APSS&AMFS–BSA is compared with APSS-BSA, AMFS–BSA, BSA, and hybrid fruit fly optimization algorithm (HFOA). HFOA is selected because the related literature has shown that HFOA is an outstanding approach for the benchmark functions (Lv et al., 2018b). The median and standard deviations (Std.) are calculated based on the twenty independent replications for performance measures. The results are shown in Table 3 and the best ones are recorded in boldface. From Table 3 we can conclude that: APSS&AMFS–BSA performs no worse than BSA, APSS-BSA, AMFS–BSA, and HFOA in cases of nineteen, sixteen, eighteen, and twelve of all the twenty benchmark functions in term of median measure. Hence, the APSS&AMFS–BSA is superior to BSA, APSS-BSA, and AMFS–BSA for these widely-used benchmark functions. 124

Z.G. Wang, Y.-R. Zeng, S.R. Wang et al.

Engineering Applications of Artificial Intelligence 81 (2019) 117–132

Table 3 Results of five algorithms for benchmark functions. Function

Criterion

HFOA

BSA

APSS-BSA

AMFS–BSA

1

Median Std.

1.69E−04 1.04E−05

7.18E−03 3.13E−03

4.68E−03 4.73E−04

4.09E−04 3.14E−5

3.26E−05 8.60E−07

APSS&AMFS–BSA

2

Median Std.

1.76E−01 2.20E−02

2.97E−01 3.53E−02

1.50E−01 2.03E−02

1.92E−01 3.42E−02

9.95E−02 1.89E−02

3

Median Std.

−1.00E+00 4.34E−07

−9.98E−01 5.11E−04

−1.00E+00 4.27E−04

−1.00E+00 5.59E−04

−1.00E+00 7.01E−05

4

Median Std.

2.88E−02 4.16E−03

1.92E+00 2.43E+01

1.98E−02 7.76E−02

1.01E−02 7.87E−02

2.81E−02 6.11E−02

5

Median Std.

2.61E−03 2.16E−04

5.82E−02 5.37E−03

7.72E−02 4.51E−02

1.42E−01 6.42E−01

6.05E−02 3.54E−02

6

Median Std.

1.05E−03 4.85E−05

3.85E−03 1.63E−03

2.57E−03 8.74E−04

1.63E−03 3.65E−04

1.55E−03 7.47E−05

7

Median Std.

1.82E−02 5.45E−04

9.08E−03 1.04E−04

2.61E−02 4.21E−04

1.08E−02 5.41E−06

8.58E−03 4.19E−05

8

Median Std.

1.30E−05 6.37E−07

1.01E−03 3.51E−05

5.37E−04 4.66E−04

4.40E−05 6.82E−5

2.64E−05 5.12E−06

9

Median Std.

0 0

0.16 0.231

0 0

0 0

0 0

10

Median Std.

2.65E−03 5.67E−05

2.56E−02 7.39E−03

5.51E−03 4.87E−04

1.01E−02 5.79E−03

3.41E−03 4.71E−05

11

Median Std.

1.84E−03 3.91E−05

6.18E−04 2.20E−06

8.73E−04 6.22E−05

5.91E−04 4.13E−05

4.18E−04 4.93E−06

12

Median Std.

2.56E−05 1.54E−06

1.45E−04 4.51E−04

8.01E−06 3.43E−07

3.80E−05 4.00E−05

9.47E−06 2.98E−07

13

Median Std.

3.06E−01 6.16E−04

3.33E−01 4.55E−03

2.29E−01 6.21E−04

3.54E−01 1.64E−02

2.33E−01 4.06E−03

14

Median Std.

5.32E−07 4.22E−08

3.02E−05 1.69E−05

1.88E−05 1.89E−05

5.93E−07 6.88E−08

4.36E−05 2.91E−05

15

Median Std.

−2.90E+01 1.45E−05

−2.90E+01 5.31E−03

−2.88E+01 4.13E−02

−2.90E+01 6.60E−03

−2.90E+01 7.70E−03

16

Median Std.

1.05E+00 2.01E−02

5.89E+00 6.81E−01

7.65E+00 1.51E−01

5.98E+00 4.33E−01

5.41E+00 6.01E−01

17

Median Std.

6.36E−08 7.13E−09

1.57E−01 1.73E−02

2.29E−03 1.32E−03

1.53E−05 5.85E−04

3.18E−06 5.50E−06

18

Median Std.

2.60E−03 1.34E−04

4.56E−01 2.21E−01

3.13E−01 2.70E−01

3.34E−01 1.78E−01

7.69E−02 4.08E−02

19

Median Std.

2.53E−03 1.58E−04

8.12E−03 6.03E−04

6.34E−03 5.29E−04

3.48E−03 8.54E−06

2.09E−03 3.32E−05

20

Median Std.

1.28E−04 1.20E−04

5.46E−05 1.43E−04

5.38E−06 1.86E−06

7.64E−06 3.50E−06

3.22E−06 1.89E−06

Table 4 Optimal modeling and training information for the seven forecasting models.

5.2. Comparative experiment 1: Canadian lynx series

Models

The Canadian lynx series contains 114 annual observations as is shown in Fig. 4. In the existing studies (Zhang, 2003; Khashei and Bijari, 2012; Adhikari, 2015; Wang et al., 2015b) the first 100 data entries (87.7%, 1821–1920) are usually partitioned as the training data set and the remaining 14 data entries (12.3%, 1921–1934) are used as the test data set. We adopt a modified version of the same partitioning system of the data set. That is, the last 14 data entries (the same size as the test data set) in the training data set are used as the validation data set for determining the best of the 300 trial models for each forecasting model. Similar to the studies of Zhang (2003), Khashei and Bijari (2012), Adhikari (2015) and Wang et al. (2015b), both the evaluation metrics MSE and MAE are employed to evaluate the performance of our proposed forecasting models. MSE is selected as the fitness function for BSA.

ESN BSA-BPNN GA-ESN BSA–ESN APSS–BSA–ESN AMFS–BSA–ESN APSS&AMFS–BSA–ESN

Error (MSE)

Model structure

Training

Validation

3.7438 × 10−21 0.0626 0.0685 0.0668 0.0636 0.0632 0.0517

0.1062 0.0913 0.0956 0.0943 0.0956 0.0929 0.0860

8 × 300 × 1 3 × 20 × 1 2 × 200 × 1 1 × 200 × 1 1 × 500 × 1 1 × 300 × 1 2 × 200 × 1

In order to validate the effectiveness of parameters setting in BSA, a series of experiments are designed based on the four optimized ESNs including BSA–ESN, APSS–BSA–ESN, AMFS–BSA–ESN and APSS&AMFS– BSA–ESN. These experiments can be divided into two sets. The first set of experiments is used to validate the effectiveness of the parameters including population size (𝑝𝑜𝑝𝑆𝑖𝑧𝑒 = 100) and maximum iteration (𝐺𝑒𝑛𝑀 = 100). It is based on the premise that the population size is set to 𝑝𝑜𝑝𝑆𝑖𝑧𝑒 = 100, the maximum iteration is set to 𝐺𝑒𝑛𝑀 = 100, and the maximum and minimum values of the mutation factor 𝐹 are respectively set to 𝐹max = 0.9 and 𝐹min = 0.1. Fig. 5 shows the

Three hundred trial candidates are investigated for each of the seven forecasting models during the training process as mentioned above. The optimal candidate is selected as the final determinate model. Table 4 shows the optimal modeling and training information for the seven forecasting models. 125

Z.G. Wang, Y.-R. Zeng, S.R. Wang et al.

Engineering Applications of Artificial Intelligence 81 (2019) 117–132

Fig. 4. Canadian lynx time series (1821–1934). Table 5 Optimal validation error of the four optimized ESNs with different values of 𝐹max and 𝐹min . Models

𝐹max

𝐹min 0.6

0.7

0.8

0.9

1.0

BSA-ESN

0 0.1 0.2 0.3 0.4

0.1001 0.1156 0.1298 0.0989 0.1137

0.1128 0.1350 0.1052 0.1122 0.1251

0.1321 0.0994 0.1011 0.1096 0.1055

0.1198 0.0943 0.1111 0.1173 0.1369

0.0988 0.0965 0.1203 0.1019 0.1120

APSS–BSA–ESN

0 0.1 0.2 0.3 0.4

0.1106 0.1099 0.1231 0.0994 0.1094

0.0971 0.1163 0.1100 0.1045 0.1021

0.1046 0.0945 0.1051 0.1077 0.1002

0.1171 0.0956 0.1062 0.1115 0.1105

0.1244 0.0991 0.1147 0.1108 0.1216

AMFS–BSA–ESN

0 0.1 0.2 0.3 0.4

0.1096 0.1007 0.1091 0.1047 0.0999

0.1071 0.1101 0.1174 0.1087 0.1009

0.1117 0.0979 0.1117 0.0973 0.1044

0.1172 0.0929 0.1172 0.0978 0.1081

0.1234 0.0939 0.1234 0.1049 0.1102

APSS&AMFS–BSA–ESN

0 0.1 0.2 0.3 0.4

0.1008 0.1057 0.0966 0.0935 0.0975

0.1075 0.0981 0.0946 0.1053 0.1103

0.0933 0.0901 0.0941 0.0977 0.0968

0.0938 0.0860 0.0951 0.0916 0.0925

0.0937 0.0911 0.1048 0.1088 0.1003

are shown in Table 6 and plotted in Fig. 6. We apply MSE and MAE as test error measurements, as shown in the last two rows of Table 6. The best results of the proposed models are shown in boldface. Using Tables 4 and 6 to compare the performance of the basic ESN with those of the four other optimized ESNs, we find the phenomenon that the basic ESN can obtain a smaller training error but produces a larger test error than the four optimized ESNs. By contrast, the latter can obtain smaller test errors but relatively larger training errors than the former. We attribute this phenomenon to the overfitting of the basic ESN during the training process. In other words, the output weights of the basic ESN obtained through a linear regression algorithm may fit the training process too well, thereby resulting in overfitting. However, the four optimized ESNs can perform better than the basic ESN and seem to have no overfitting problem. Thus, we apply BSA (basic BSA and its variants) to optimize the output weights of ESN. For a comparison of two methods, we infer relative performance from relative error (Wang et al., 2016b). For the given two methods 𝐴 (base method) and 𝐵 (comparison method) with errors 𝛿𝐴 and 𝛿𝐵 , 𝛿 −𝛿 the relative error is 𝐵𝛿 𝐴 . Relative error is a popular measure used to 𝐴 identify the method with superior performance when two methods are compared. The lower the relative error, the better the performance of the comparison method than that of the base method. In this study, the proposed APSS&AMFS–BSA–ESN model is set as the comparison method, whereas the ESN, BSA–ESN, APSS–BSA–ESN, AMFS–BSA–ESN, BSA-BPNN, GA-ESN and other existing forecasting models are set as the base methods. Table 7 shows the relative errors of the pairs of models being compared in the measurement of MSE. The relative errors in Table 7 show that the proposed APSS&AMFS– BSA–ESN model is superior to the four other proposed models, ESN, BSA–ESN, APSS–BSA–ESN, and AMFS–BSA–ESN, and the two comparison models, as well as other existing models. (a) The forecasting performance of the APSS&AMFS–BSA–ESN model is better than those of the six other proposed models and has MSE improvements of 82.18%, 48.64%, 73.07%, 13.11%, 27.30%, and 21.56%. Hence, the proposed APSS&AMFS–BSA–ESN model can achieve better forecasting accuracy than the six other proposed models. This information is shown as ‘‘Relative error (MSE)’’ in the last column of Table 7. We perform nonparametric statistical tests on the experimental results using the absolute error measure to further determine if the performance of the proposed APSS&AMFS–BSA–ESN model is significantly different from those of the six other proposed models. The

optimization processes of BSA and BSA variants for the output weights of ESN. It can be learned from Fig. 5 that all the four optimized ESNs could reach relative optimal iterative MSE in less than 90 generations. On the other hand, the second set of experiments is applied to validate the effectiveness of the maximum and minimum values of the mutation factor 𝐹 , 𝐹max and 𝐹min , with the value of 𝐹max from 0.6 to 1.0 with the step length 0.1 and 𝐹min from 0 to 0.4 with the step length of 0.1. The premise of these experiments is that the population size is set to 𝑝𝑜𝑝𝑆𝑖𝑧𝑒 = 100 and the maximum iteration is set to 𝐺𝑒𝑛𝑀 = 100. The results of these experiments are shown in Table 5. The best result for each group is recorded in boldface. It is clear that we can obtain better validation results in the cases with 3 of 4 when 𝐹max is set to 0.9 and 𝐹min is set to 0.1. Thus, the BSA parameters are set as mentioned in Section 5.1.4. Moreover, these parameters are maintained unchanged for the current and later comparative experiments. The actual values of the test logarithmic Canadian lynx time series and the results acquired from the proposed seven forecasting models 126

Z.G. Wang, Y.-R. Zeng, S.R. Wang et al.

Engineering Applications of Artificial Intelligence 81 (2019) 117–132

Fig. 5. The iterative MSE trend of the proposed four optimized ESN models (For Canadian lynx series).

Table 6 Actual values of the test logarithmic Canadian lynx data and forecasting results of the seven proposed models. Years

Actual

ESN

BSA-BPNN

GA-ESN

BSA–ESN

APSS–BSA–ESN

AMFS–BSA–ESN

APSS&AMFS–BSA–ESN

1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 MSE MAE

2.3598 2.6010 3.0538 3.3860 3.5532 3.4676 3.1867 2.7235 2.6857 2.8209 3.0000 3.2014 3.4244 3.5310

2.1894 2.8501 2.7198 3.2697 3.6190 3.1626 3.2271 2.9164 2.4376 2.8409 2.9387 3.2543 3.3586 3.4015 0.031576 0.146553

2.2068 2.4503 2.8801 3.4545 3.5958 3.3073 3.1299 2.8823 2.7086 2.8213 2.9512 3.2588 3.3951 3.4343 0.010956 0.087145

2.3815 2.7252 2.8012 3.5056 3.6069 3.2082 3.0956 2.9006 2.4549 2.8482 3.1164 3.2775 3.3648 3.4229 0.020893 0.122707

2.2547 2.6023 3.0515 3.4300 3.5935 3.3574 3.0891 2.8634 2.6653 2.7334 3.0813 3.2158 3.3404 3.4176 0.006476 0.067277

2.3734 2.6840 2.9637 3.4535 3.5772 3.4436 3.1691 2.9311 2.6306 2.8255 3.1346 3.2624 3.3309 3.4291 0.007740 0.069878

2.3312 2.6415 2.9164 3.3744 3.5542 3.4595 3.1336 2.8002 2.4950 2.7337 3.0009 3.1719 3.3034 3.4276 0.007174 0.063557

2.3045 2.6769 2.9237 3.4134 3.5460 3.4489 3.1099 2.8197 2.5392 2.8401 3.0930 3.2635 3.3190 3.4959 0.005627 0.062648

Fig. 6. Actual values and forecasting results of the seven proposed models of test logarithmic Canadian lynx series.

127

Z.G. Wang, Y.-R. Zeng, S.R. Wang et al.

Engineering Applications of Artificial Intelligence 81 (2019) 117–132

Table 7 Relative errors of the pairs of models being compared. Models

MSE

MAE

Relative error (MSE)

Existing models

Zhang’s hybrid ARIMA/ANNs Khashei’s ANN/PNN Khashei’s ARIMA/PNN Wang’s ADE–BPNN Adhikari’s linear combination method

0.017233 0.014872 0.011461 0.010392 0.006

0.103972 0.079628 0.084381 0.070623 0.068

−67.35% −62.16% −50.90% −45.85% −6.22%

Proposed models

ESN BSA-BPNN GA-ESN BSA–ESN APSS–BSA–ESN AMFS–BSA–ESN APSS&AMFS–BSA–ESN

0.031576 0.010956 0.020893 0.006476 0.007740 0.007174 0.005627

0.146553 0.087145 0.122707 0.067277 0.069878 0.063557 0.062648

−82.18% −48.64% −73.07% −13.11% −27.30% −21.56% 0

Table 8 Absolute errors between predicted values of the proposed seven models and actual values. Years

ESN

BSA-BPNN

GA-ESN

BSA–ESN

APSS–BSA–ESN

AMFS–BSA–ESN

APSS&AMFS–BSA–ESN

1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934

0.1704 0.2491 0.334 0.1163 0.0658 0.305 0.0404 0.1929 0.2481 0.02 0.0613 0.0529 0.0658 0.1295

0.153 0.1507 0.1737 0.0685 0.0426 0.1603 0.0568 0.1588 0.0229 0.0004 0.0488 0.0574 0.0293 0.0967

0.0217 0.1242 0.2526 0.1196 0.0537 0.2594 0.0911 0.1771 0.2308 0.0273 0.1164 0.0761 0.0596 0.1081

0.1051 0.0013 0.0023 0.044 0.0403 0.1102 0.0976 0.1399 0.0204 0.0875 0.0813 0.0144 0.084 0.1134

0.0136 0.083 0.0901 0.0675 0.024 0.024 0.0176 0.2076 0.0551 0.0046 0.1346 0.061 0.0935 0.1019

0.0286 0.0405 0.1374 0.0116 0.001 0.0081 0.0531 0.0767 0.1907 0.0872 0.0009 0.0295 0.121 0.1034

0.0553 0.0759 0.1301 0.0274 0.0072 0.0187 0.0768 0.0962 0.1465 0.0192 0.093 0.0621 0.1054 0.0351

Table 9 Wilcoxon signed-rank test.

5.3. Comparative experiment 2: Sunspot time series

Compared methods APSS&AMFS–BSA–ESN APSS&AMFS–BSA–ESN APSS&AMFS–BSA–ESN APSS&AMFS–BSA–ESN APSS&AMFS–BSA–ESN APSS&AMFS–BSA–ESN

versus versus versus versus versus versus

ESN BSA-BPNN GA-ESN BSA–ESN APSS–BSA–ESN AMFS–BSA–ESN

R+

R−

p-value

Diff?

14 68 94 49 51 63

91 37 11 56 54 42

0.0157 0.331 0.009 0.8261 0.925 0.5098

Yes No Yes No No No

The sunspot series is a well-known time series that records annual visible spots observed on the face of the sun since 1700. Generally, the period of 1700–1987 (288 data) is employed as a standard time series by many researchers (Zhang, 2003; Ghiassi and Saidane, 2005; Adhikari, 2015) to evaluate the performance of various forecasting methods. The standard sunspot series is plotted in Fig. 7. The training and testing data of the sunspot series are partitioned differently in existing studies (Zhang, 2003; Ghiassi and Saidane, 2005; Adhikari, 2015). Zhang (2003) used this time series to verify the effectiveness of the three proposed ANN, ARIMA, and hybrid models. He applied the data for 1700–1920 (221 data) for training and those for 1921–1955 (35 data) and 1921–1987 (67 data) for respective testing. The two evaluation metrics, MSE and MAD, were adopted to evaluate the forecasting accuracy of the three models. Ghiassi and Saidane (2005) followed the same partitioning of the data set to evaluate the effectiveness of the DAN2 model. They applied the SPSS, DAN2–M1, and DAN2–M2 models, which also adopt MSE and MAD as the measurement metrics. Adhikari (2015) proposed a linear combination method to designate the data for 1700–1952 (253) as training data and those for 1952–1987 (35 data) as testing data. He also adopted two evaluation metrics MSE and MAE to evaluate the forecasting performance. In our simulation, we present three partitioning cases of the sunspot time series. This is to make a fair comparison with the existing forecasting models. In the first two cases, the data for 1700–1920 (221) are designated as training data, and the next 35 data and 67 data are respectively applied to test the trained models in case 1 and case 2. In the third case, the data for 1700–1952 (253 data) are designated as training data and the data for 1953–1987 (35) are designated as testing data. Meanwhile, we designate the final 35, 67, and 35 data entries in the training data sets in cases 1, 2, and 3, respectively, as validation data sets to determine the best of the 300 trial models for each forecasting model. The optimal modeling and training information for the seven proposed models are presented in Table 10. The results

absolute error indicates the absolute value of the error between the predicted value and the actual value. The smaller the absolute error, the better. Table 8 shows the absolute error between the values produced by the seven proposed models and the actual values. The Friedman test outcomes indicate that there are significant differences exist among the compared methods because the 𝑝-value 0.011 is smaller than fiducial value 0.05. Moreover, the Wilcoxon signed-rank test results shown in Table 9 indicate that the APSS&AMFS–BSA–ESN model is statistically better than the ESN and GA-ESN models with p-values of 0.0157 and 0.009 and is not significantly better than the BSA-BPNN, BSA–ESN, APSS–BSA–ESN, and AMFS–BSA–ESN models with p-values of 0.331, 0.8261, 0.925, and 0.5098, respectively. This information indicates that the optimized ESN outperforms the basic ESN. (b) The forecasting performance of the APSS&AMFS–BSA–ESN model is better than those of the existing models. A comparison of the forecasting performance of the proposed APSS&AMFS–BSA–ESN model with those of other existing models, such as hybrid ARIMA/ANNs (Zhang, 2003), ANN/PNN (Khashei and Bijari, 2012), ARIMA/PNN (Khashei and Bijari, 2012), ADE–BPNN (Wang et al., 2015b), and a linear combination method (Adhikari, 2015), is shown in Table 7. The proposed APSS&AMFS–BSA–ESN model outperforms ARIMA/ANNs, ANN/PNN, ARIMA/PNN, ADE–BPNN, and the linear combination method with MSE improvements of 67.35%, 62.16%, 50.90%, 45.85%, and 6.22%. This information is shown as ‘‘Relative error (MSE)’’ in the last column of Table 7. 128

Z.G. Wang, Y.-R. Zeng, S.R. Wang et al.

Engineering Applications of Artificial Intelligence 81 (2019) 117–132

Fig. 7. Sunspot series (1700–1987).

Table 10 Optimal modeling and training information for all the seven forecasting models. Models

Training (1700–1920)

Training (1700–1952)

Case 1 (1921–1955) Error (MSE)

ESN BSA-BPNN GA-ESN BSA–ESN APSS–BSA–ESN AMFS–BSA–ESN APSS&AMFS–BSA–ESN

Case 2 (1921–1987) Structure

Training

Validation

5.8810 × 10−15 232.5563 320.1152 218.9468 315.8604 257.8666 207.3930

754.8305 267.5145 329.7561 251.7041 328.5683 199.5155 258.5981

10 × 500 × 1 4 × 20 × 1 5 × 300 × 1 1 × 300 × 1 2 × 500 × 1 6 × 200 × 1 3 × 500 × 1

Error (MSE)

Case 3 (1953–1987) Structure

Training

Validation

1.9089 × 10−15 226.6352 245.2039 241.7963 229.0068 204.8984 206.3068

1531.9433 241.9961 260.2233 236.9870 252.7344 231.1842 237.4302

10 × 500 × 1 2 × 30 × 1 3 × 200 × 1 2 × 200 × 1 1 × 200 × 1 2 × 200 × 1 2 × 200 × 1

Error (MSE)

Structure

Training

Validation

58.3016 239.3256 251.7788 258.6495 233.3103 227.7689 233.1632

1177.4098 279.2168 299.0254 290.6599 300.0323 256.7258 233.9064

1 × 500 × 1 5 × 30 × 1 4 × 300 × 1 2 × 200 × 1 1 × 200 × 1 5 × 300 × 1 4 × 200 × 1

Table 11 Comparison of the performance of the proposed forecasting models with those of existing models. Models

Training (1700–1920)

Training (1700–1952)

Case 1 (1921–1955) MSE(MAD)

Case 2 (1921–1987) MSE(MAD)

Case 3 (1953–1987) MSE(MAE)

Existing models

Zhang’s ARIMA Zhang’s ANN Zhang’s Hybrid Ghiassi’s SPSS Ghiassi’s DAN2–M1 Ghiassi’s DAN2–M2 Adhikari’s linear combination method

217(11.3) 205(10.2) 187(10.8) 167(10.1) 197(10.2) 146(9.6) N/A

306(13.0) 351(13.5) 280(12.8) 359(14.1) 286(11.9) 266(12.3) N/A

N/A N/A N/A N/A N/A N/A 311(13.49)

Proposed models

ESN BSA-BPNN GA-ESN BSA–ESN APSS–BSA–ESN AMFS–BSA–ESN APSS&AMFS–BSA–ESN

448(14.5) 135(7.2) 180(8.5) 107(4.5) 150(6.0) 130(6.6) 119(5.8)

1165(21.8) 244(8.0) 256(7.3) 231.4(6.1) 258(7.1) 241(5.9) 214(6.4)

1826(30.62) 307(12.95) 322(14.1) 320(13.79) 369(15.19) 296(12.74) 286(12.10)

of the seven proposed models are presented in Table 11 and depicted in Figs. 8, 9, and 10. The results of the existing forecasting models are also presented in Table 11. The best results are shown in boldface. The phenomenon observed in the Canadian lynx series is also seen in this series when we integrate the information from Tables 10 and 11. This information helps confirm the necessity of applying BSA (basic BSA and its variants) to optimize the output weights of ESN. Table 11 shows that in cases 1 and 2, the performance of the proposed APSS&AMFS–BSA–ESN model is superior to those of Zhang’s ARIMA,

ANN, and hybrid; Ghiassi’s SPSS, DAN2–M1, and DAN2–M2; and the proposed ESN, APSS–BSA–ESN, AMFS–BSA–ESN, BSA–BPNN and GA– ESN models. However, when APSS&AMFS–BSA–ESN is compared with the proposed BSA–ESN model, the former performs relatively unsatisfactorily in case 1 but performs well in case 2. The reason for this unsatisfactory result is multifaceted. One reason is that this model may not always be appropriate for all data sets. In case 3, the proposed APSS&AMFS–BSA–ESN model outperforms all the other proposed models and Adhikari’s linear combination method. 129

Z.G. Wang, Y.-R. Zeng, S.R. Wang et al.

Engineering Applications of Artificial Intelligence 81 (2019) 117–132

Fig. 8. Actual values and forecasting results of the seven proposed models of the sunspot series (case 1).

Fig. 9. Actual values and forecasting results of the seven proposed models of the sunspot series (case 2).

6. Conclusions and future work

models and other existing models. Although the optimized ESN models, exhibit better performance than existing models, they are relatively time consuming because of the execution of BSA or its variants. In the future, we may investigate more effective intelligent algorithms (Wang et al., 2019) to optimize the output weights of ESN because computing the output weights of ESN using the conventional linear regression method leads to the overfitting problem of the trained network. Meanwhile, we may apply EAs to optimize ESN architectures that are more complex than the one adopted in the current study. Then, the improved ESN can be used for effective forecasting problems in various fields (Liu et al., 2018b; Zhou et al., 2018, 2019).

The current study presents a novel concept for solving the overfitting problem of the basic ESN in order to improve its performance. The three BSA variants APSS–BSA, AMFS–BSA, and APSS&AMFS–BSA are designed to avoid overfitting. Unlike the basic ESN, which applies a conventional linear regression method to compute output weights, the proposed approaches utilize BSA (basic BSA or its variants) for optimization, thereby avoiding overfitting in the trained ESN during training. Then, two widely used benchmark time series are employed to verify the feasibility and effectiveness of the optimized BSA based ESN models BSA–ESN, APSS–BSA–ESN, AMFS–BSA–ESN, and APSS&AMFS– BSA–ESN, the basic ESN model, and two comparison models BSA–BPNN and GA–ESN. Experimental results indicate that (a) the five optimized ESN models exhibit better forecasting accuracies than the basic ESN model and (b) the APSS&AMFS–BSA–ESN model is superior to the six other proposed

Acknowledgments The authors are very grateful for the constructive comments of editors and referees. This research is partially supported by National Natural Science Foundation of China (Grant number: 71771095; 71531009) 130

Z.G. Wang, Y.-R. Zeng, S.R. Wang et al.

Engineering Applications of Artificial Intelligence 81 (2019) 117–132

Fig. 10. Actual values and forecasting results of the seven proposed models of the sunspot series (case 3).

and Humanities and Social Sciences Foundation of Chinese Ministry of Education, China (No.18YJA630005).

Ghiassi, M., Saidane, H., 2005. A dynamic architecture for artificial neural network. Neurocomputing 63, 397–413. Hornik, K., Stinchcombe, M., White, H., 1989. Multilayer feedforward networks are universal approximators. Neural Netw. 2 (5), 359–366. Huang, B.B., Qin, G.H., Zhao, R., Wu, Q., Shahriari, A., 2018. Recursive Bayesian echo state network with an adaptive inflation factor for temperature prediction. Neural Comput. Appl. 29 (12), 1535–1543. Jaeger, H., 2001. The ‘‘echo state’’ approach to analysing and training recurrent neural networks. GMD Report 148, GMD–German National Research Institute for Computer Science. Jaeger, H., Haas, H., 2004. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science 304 (5667), 78–80. Khashei, M., Bijari, M., 2012. A new class of hybrid models for time series forecasting. Expert Syst. Appl. 39 (4), 4344–4357. Kobialka, H.U., Kayani, U., 2010. Echo state networks with sparse output connections. In: Proceedings of 2010 International Conference on Artificial Neural Networks. pp. 356–361. Lacy, S.E., Smith, S.L., Lones, M.A., 2018. Using echo state networks for classification: A case study in parkinson’s disease diagnosis. Artif. Intell. Med. 86, 53–59. Li, G., Song, H., 2008. Tourism demand modelling and forecasting-A review of recent research. Tourism Manag. 29 (2), 203–220. Liu, L., Wang, Z., Yao, X., Zhang, H., 2018a. Echo state networks based data-driven adaptive fault tolerant control with its application to electromechanical system. IEEE-ASME Trans. Mechatronics 23 (3), 1372–1382. Liu, R., Zeng, Y.R., Qu, H., Wang, L., 2018b. Optimizing the new coordinated replenishment and delivery model considering quantity discount and resource constraints. Comput. Ind. Eng. 116, 82–96. Lukoševičius, M., Jaeger, H., 2009. Reservoir computing approaches to recurrent neural network training. Comput. Sci. Rev. 3 (3), 127–149. Lv, S.X., Peng, L., Wang, L., 2018a. Stacked autoencoder with echo-state regression for tourism demand forecasting using search query data. Appl. Soft Comput. 73, 119–133. Lv, S.X., Zeng, Y.R., Wang, L., 2018b. An effective fruit fly optimization algorithm with hybrid information exchange and its applications. Int. J. Mach. Learn. Cybern. Springer 9 (10), 1623–1648. Madasu, S.D., Kumar, M.L.S.S., Singh, A.K., 2017. Comparable investigation of backtracking search algorithm in automatic generation control for two area reheat interconnected thermal power system. Appl. Soft Comput. 55, 197–210. Mirjalili, S., Hashim, S.Z.M., Sardroudi, H.M., 2012. Training feedforward neural networks using hybrid particle swarm optimization and gravitational search algorithm. Appl. Math. Comput. 218 (22), 11125–11137. Pan, Q.K., Sang, H.Y., Duan, J.H., Gao, L., 2014. An improved fruit fly optimization algorithm for continuous function optimization problems. Knowl.-Based Syst. 62, 69–83. Peng, L., Liu, S., Liu, R., Wang, L., 2018. Effective long short-term memory with differential evolution algorithm for electricity price prediction. Energy 162, 1301–1314. Peng, H.W., Wu, S.F., Wei, C.C., Lee, S.J., 2015. Time series forecasting with a neuro-fuzzy modeling scheme. Appl. Soft Comput. 32, 481–493. Rubio, A., Bermúdez, J.D., Vercher, E., 2017. Improving stock index forecasts by using a new weighted fuzzy-trend time series method. Expert Syst. Appl. 76, 12–20.

Conflict of interest The authors declare that they have no conflict of interest. References Adhikari, R., 2015. A neural network based linear ensemble framework for time series forecasting. Neurocomputing 157, 231–242. Andrawis, R.R., Atiya, A.F., El-Shishiny, H., 2011. Forecast combinations of computational intelligence and linear models for the NN5 time series forecasting competition. Int. J. Forecast. 27 (3), 672–688. Bianchi, F.M., Scardapane, S., Uncini, A., Rizzi, A., Sadeghian, A., 2015. Prediction of telephone calls load using Echo State Network with exogenous variables. Neural Netw. 71, 204–213. Campbell, M.J., Walker, A.M., 1977. A survey of statistical work on the Mackenzie River series of annual Canadian lynx trappings for the years 1821–1934 and a new analysis. J. R. Stat. Soc. (Ser. A) 140 (4), 411–431. Chaib, A.E., Bouchekara, H.R.E.H., Mehasni, R., Abido, M.A., 2016. Optimal power flow with emission and non-smooth cost functions using backtracking search optimization algorithm. Int. J. Electr. Power Energy Syst. 81, 64–77. Chandra, R., Zhang, M.J., 2012. Cooperative coevolution of Elman recurrent neural networks for chaotic time series prediction. Neurocomputing 86, 116–123. Chatzis, S.P., Demiris, Y., 2012. The copula echo state network. Pattern Recognit. 45 (1), 570–577. Cherif, A., Cardot, H., Boné, R., 2011. SOM Time series clustering and prediction with recurrent neural networks. Neurocomputing 74 (11), 1936–1944. Chouikhi, N., Ammar, B., Rokbani, N., Alimi, A.M., 2017. PSO-based Analysis of echo state network parameters for time series forecasting. Appl. Soft Comput. 55, 211–225. Civicioglu, P., 2013. Backtracking search optimization algorithm for numerical optimization problems. Appl. Math. Comput. 219 (15), 8121–8144. Cottrell, M., Girard, B., Girard, Y., Mangeas, M., Muller, C., 1995. Neural modeling for time series: a statistical stepwise method for weight elimination. IEEE Trans. Neural Netw. 6 (6), 1355–1364. Crone, S.F., Hibon, M., Nikolopoulos, K., 2011. Advances in forecasting with neural networks? empirical evidence from the NN3 competition on time series prediction. Int. J. Forecast. 27 (3), 635–660. Dutoit, X., Schrauwen, B., Van Campenhout, J., Stroobandt, D., Van Brussel, H., Nuttin, M., 2009. Pruning and regularization in reservoir computing. Neurocomputing 72 (7–9), 1534–1546. Esling, P., Agon, C., 2012. Time-series data mining. ACM Comput. Surv. 45 (1), 1–34. Ewing, B.T., Thompson, M.A., 2012. Forecasting wind speed with recurrent neural networks. European J. Oper. Res. 221 (1), 148–154. Fildes, R., Kourentzes, N., 2011. Validation and forecasting accuracy in models of climate change. Int. J. Forecast. 27 (4), 968–995. 131

Z.G. Wang, Y.-R. Zeng, S.R. Wang et al.

Engineering Applications of Artificial Intelligence 81 (2019) 117–132 Wang, L., Zeng, Y., Chen, T., 2015b. Back propagation neural network with adaptive differential evolution algorithm for time series forecasting. Expert Syst. Appl. 42, 855–863. Xu, M., Han, M., Qiu, T., Lin, H., 2018. Hybrid regularized echo state network for multivariate chaotic time series prediction. IEEE Trans. Cybern. PP (99), 1–11. http://dx.doi.org/10.1109/TCYB.2018.2825253. Yang, C., Qiao, J., Han, H., Wang, L., 2018. Design of polynomial echo state networks for time series prediction. Neurocomputing 290, 148–160. Zeng, Y., Zeng, Y.R., Choi, B., Wang, L., 2017. Multifactor-influenced energy consumption forecasting using enhanced back-propagation neural network. Energy 127, 381–396. Zhang, G.P., 2003. Time series forecasting using a hybrid ARIMA and neural network model. Neurocomputing 50, 159–175. Zhang, C., Liu, Q., Gao, L., Li, X., 2015. Backtracking search algorithm with three constraint handling methods for constrained optimization problems. Expert Syst. Appl. 42 (21), 7831–7845. Zhang, G.P., Patuwo, B.E., Hu, M.Y., 1998. Forecasting with artificial neural networks: The state of the art. Int. J. Forecast. 14 (1), 35–62. Zhou, Q., Benlic, U., Wu, Q.H., Hao, J.K., 2019. Heuristic search to the capacitated clustering problem. Eur. J. Oper. Res. 273 (2), 464–487. Zhou, H., Wang, H., Zeng, W., 2018. Smart construction site in mega construction projects: A case study on island tunneling project of Hong Kong-Zhuhai-Macao Bridge. Front. Eng. Manag. 5 (1), 78–87. Zou, F., Chen, D., Li, S., Lu, R., Lin, M., 2017. Community detection in complex networks: Multi-objective discrete backtracking search optimization algorithm with decomposition. Appl. Soft Comput. 53, 285–295.

Skowronski, M.D., Harris, J.G., 2007. Automatic speech recognition using a predictive echo state network classifier. Neural Netw. 20, 414–423. Tak-chung, F., 2011. A review on time series data mining. Eng. Appl. Artif. Intell. 24 (1), 164–181. Tong, M.H., Bickett, A.D., Christiansen, E.M., Cottrell, G.W., 2007. Learning grammatical structure with echo state networks. Neural Netw. 20 (3), 424–432. Wang, S., Da, X., Li, M., Han, T., 2016a. Adaptive backtracking search optimization algorithm with pattern search for numerical optimization. J. Syst. Eng. Electron. 27 (2), 395–406. Wang, L., Hu, H., Ai, X.Y., Liu, H., 2018a. Effective electricity energy consumption forecasting using echo state network improved by differential evolution algorithm. Energy 153, 801–815. Wang, L., Hu, H., Liu, R., Zhou, X., 2019. An improved differential harmony search algorithm for function optimization problems. Soft Comput, online http://dx.doi. org/10.1007/s00500-018-3139-4. Wang, L., Lv, S.X., Zeng, Y.R., 2018b. Effective sparse adaboost method with ESN and FOA for industrial electricity consumption forecasting in China. Energy 155 (2018), 1013–1031. Wang, L., Shi, Y., Liu, S., 2015a. An improved fruit fly optimization algorithm and its application to joint replenishment problems. Expert Syst. Appl. 42 (9), 4310–4323. Wang, L., Wang, Z., Liu, S., 2016b. An effective multivariate time series classification approach using echo state network and adaptive differential evolution algorithm. Expert Syst. Appl. 43, 237–249. Wang, L., Wang, Z., Qu, H., Liu, S., 2018. Optimal forecast combination based on neural networks for time series forecasting. Appl. Soft Comput. 66, 1–17. Wang, H., Yan, X., 2015. Optimizing the echo state network with a binary particle swarm optimization algorithm. Knowl.-Based Syst. 86, 182–193.

132