machine learning with parallel neural networks for...

29
Computational Economics https://doi.org/10.1007/s10614-019-09960-5 Machine learning with parallel neural networks for analyzing and forecasting electricity demand Yi-Ting Chen 1 · Edward W. Sun 2 · Yi-Bing Lin 1 Accepted: 6 December 2019 © Springer Science+Business Media, LLC, part of Springer Nature 2019 Abstract Traditional methods applied in electricity demand forecasting have been challenged by the course of dimensionality arisen with a growing number of distributed or decentral- ized energy systems are employing. Without manually operated data preprocessing, classic models are not well-calibrated for their robustness when dealing with the dis- ruptive elements (e.g., demand changes in holidays and extreme weather). Based on the application of big data driven analytics, we propose a novel machine learning method originating from the parallel neural networks for robust monitoring and forecasting power demand to enhance supervisory control and data acquisition for new industrial tendency such as Industry 4.0 and Energy IoT. Through our approach, we generalize the implementation of machine learning by using classic feed-forward neural networks, for parallelization in order to let the proposed method achieve superior performance when dealing with high dimensionality and disruptiveness. With the high-frequency data of consumption in Australia from January 2009 to December 2015, the overall empirical results confirm that our proposed method performs significantly better for dynamic monitoring and forecasting of power demand comparing with the classic methods. Keywords Big data · Energy · Forecasting · Machine learning · Neural networks (PNNs) JEL Classification C02 · C10 · C63 B Edward W. Sun [email protected] 1 College of Computer Science, National Chiao Tung University, Hsinchu, Taiwan 2 KEDGE Business School, 680 Cours de la Libèration, 33405 Talance Cedex, France 123

Upload: others

Post on 28-Jul-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine learning with parallel neural networks for …liny.csie.nctu.edu.tw/document/CSEM-Chen-Sun-Lin-2019.pdfElectricity demand (utilization or load1) forecasting is essential for

Computational Economicshttps://doi.org/10.1007/s10614-019-09960-5

Machine learning with parallel neural networks foranalyzing and forecasting electricity demand

Yi-Ting Chen1 · Edward W. Sun2 · Yi-Bing Lin1

Accepted: 6 December 2019© Springer Science+Business Media, LLC, part of Springer Nature 2019

AbstractTraditionalmethods applied in electricity demand forecasting have been challenged bythe course of dimensionality arisen with a growing number of distributed or decentral-ized energy systems are employing. Without manually operated data preprocessing,classic models are not well-calibrated for their robustness when dealing with the dis-ruptive elements (e.g., demand changes in holidays and extremeweather). Based on theapplication of big data driven analytics, we propose a novel machine learning methodoriginating from the parallel neural networks for robust monitoring and forecastingpower demand to enhance supervisory control and data acquisition for new industrialtendency such as Industry 4.0 andEnergy IoT.Throughour approach,wegeneralize theimplementation of machine learning by using classic feed-forward neural networks,for parallelization in order to let the proposed method achieve superior performancewhen dealing with high dimensionality and disruptiveness. With the high-frequencydata of consumption in Australia from January 2009 to December 2015, the overallempirical results confirm that our proposed method performs significantly better fordynamic monitoring and forecasting of power demand comparing with the classicmethods.

Keywords Big data · Energy · Forecasting · Machine learning · Neural networks(PNNs)

JEL Classification C02 · C10 · C63

B Edward W. [email protected]

1 College of Computer Science, National Chiao Tung University, Hsinchu, Taiwan

2 KEDGE Business School, 680 Cours de la Libèration, 33405 Talance Cedex, France

123

Page 2: Machine learning with parallel neural networks for …liny.csie.nctu.edu.tw/document/CSEM-Chen-Sun-Lin-2019.pdfElectricity demand (utilization or load1) forecasting is essential for

Y.-T. Chen et al.

1 Introduction

Electricity demand (utilization or load1) forecasting is essential for effective powersystem operation and attracts major attention from the restructured electricity mar-ket. Ghadimi et al. (2018) point out electricity load is non-linear with a high levelof volatility and predicting such complex signals requires suitable prediction tools.The volatility illustrates strong seasonality in different time frames (i.e., the intra-day,intra-week, intra-month, and intra-year seasonality). Meantime, anomalies constantlyoccur in the power consumption. There aremany days (e.g., public holidays or extremeweather days) exhibit load profiles that differ noticeably from the repeating patternsobserved in normal days. Anomalous utilization deviates significantly from normalload, and is therefore not straightforward to model. For example, Abedini et al. (2019)pointed out that the participation of large consumer in the market is a challenging issuedue to their considerable load demand. Some special operations also lead to anomalousloads, see Saeedi et al. (2019) for example. The relatively infrequent occurrence ofanomalies results in a lack of training observations for adequately training the model.Different anomalies exhibit different load profiles, which require each of them to bemodeled uniquely. These make statistical modeling of anomalous load very challeng-ing, see Siddharth and Taylor (2018) and references therein.

In this article, we implement a novel framework for processing big data collections,analyses, and management procedures to facilitate robust monitoring, forecasting, andeconomic assessment based on the scope of the Energy IoT. This framework relies onan architecture of a embedded system built on parallelizing neural networks (PNNs),which is a task-oriented group that parallelizes several individual neural networkstogether to create a new, more complex system that dedicates more functionality andsuperior performance than simply the sum of many signal neural networks. In thisframework, the decision-making process based on forecasting the short-term demandis conducted on an optimization that is generally implemented with mathematicalprogramming of genetic algorithms coupled with neural networks. In this article, thedecision-making algorithm applied in our framework is based on a simple decisionrule to minimize the forecasting error.

Neural network (NN) methods have a property of universal approximation and canachieve high predictive accuracy, therefore, they have received a lot of attention indecision-support systems. Although being considered as a singular approach,2 NNmethod lose their advantages in interpretability in comparison to parametric modelsbecause they clarify decision making of the NN by explanatory rules that extract thepatterns and capture the learned knowledge embedded in the networks, which can stillhelp us illustrate the explanatory capability in response to why a specific implementa-

1 The end use of the generation, transmission, and distribution of electric power. Load requires specificamount of energy over a predetermined period of time. The total load of the power system is the sum of thetotal power consumed by all the electrical equipment in the system.2 The non-identifiability of the NNmodel is similar to other methods, such as the wavelet method, Bayesiannetworks, hidden Markov model, and topological data mining (see Sun et al. 2015, and references therein).Themodel is referred to as being singular because its Fisher informationmatrix is unusual, and knowledge tobe learned and patterns to be discovered from the sample space correspond to an individual instance; hence,difficulties arise when developing a mathematical tool that enables us to understand statistical estimationand interpret learning processes.

123

Page 3: Machine learning with parallel neural networks for …liny.csie.nctu.edu.tw/document/CSEM-Chen-Sun-Lin-2019.pdfElectricity demand (utilization or load1) forecasting is essential for

Machine learning with parallel neural networks for…

tion is classified as either bad or good. Tam and Kiang (1992) introduce an approachthat can perform discriminant analysis in business research with neural networks.Compared with a linear classifier, logistics regression, kNN, and an ID 3 algorithm,they show that the NN approach is a promisingmethod in terms of predictive accuracy,adaptability, and robustness. Wang (1995) suggests a possible way of improving theperformance of NN approach in managerial applications by offering an extension ofTam and Kiang (1992). Hill et al. (1996) compare six statistical time-series modelswith NN approach for forecasting and report that across monthly and quarterly timeseries, the neural networks perform significantly better than these traditional models.In order to improve the classification accuracy and reduce computational time of neu-ral network, Piramuthu et al. (1998) develop a methodology of feature constructionto efficiently transform the training data. Baesens et al. (2003) use neural network forrule extraction in credit-risk evaluation and conclude that the neural network methodis powerful as a management tool for decision-support systems. Kim et al. (2005)propose a new approach of neural networks with genetic algorithm in principal com-ponent analysis (PCA) and indicate that the underlying method is more accurate andparsimonious than traditional PCA methods. Wong et al. (2010) propose an adaptiveneural network (ADNN)with the adaptivemetrics of inputs for time-series predictions,and Kiani (2011) employed artificial neural networks (ANNs) to predict fluctuationsin economic activity in several members of the Commonwealth of Independent States(CIS) using macroeconomic time series. Venkatesh et al. (2014) perform predictionof cash demand for groups of ATMs with different neural networks. Etemadi et al.(2015) examined earnings per share forecasting using multi-layer perceptron (MLP)neural network and determined an optimal model based on evaluating the forecastingaccuracy. Stasinakis et al. (2016) investigated the efficiency of the radial basis func-tion neural networks in forecasting the US unemployment and reported that the neuralnetwork based method statistically outperforms all competing models’ performances.Katris (2019) explored and analyzed time series with feed forward neural networks forprediction of unemployment. Levendovszky et al. (2019) performed a feed-forwardneural network to obtain an efficient estimation of the forward conditional probabilitydistribution in electronic trading. Ramyar and Kianfar (2019) investigates predictabil-ity of oil prices using amultilayer perceptron neural network that considers exhaustiblenature of crude oil and impact of monetary policy along with other major drivers ofcrude oil prices. Gao et al. (2019) proposed a forecast engine that is comprised ofmulti-block neural network (NN) and optimized by an intelligent algorithm to increase thetraining mechanism and forecasting abilities. Therefore, our study will enrich the NNapproaches currently applied in management science.

Using parallelization applies naturally to the time-consuming algorithms forimproving the efficiency. Swann (2002) takes advantage of parallel computing tosolve a maximum likelihood problem using the MPI message passing library. Creel(2008) parallelizes the parameterized expectations algorithm (PEA) to reduce thetime needed to solve a simple model by more than 80%. Cai et al. (2015) imple-ments parallel dynamic programming methods in HTCondor Master-Worker systemto solve demanding high-dimensional dynamic programming problems and reportsthe computational productivity can be increased by at least two orders of magnitude.Muresano and Pagano (2016) apply a multi-core architecture in order to improve the

123

Page 4: Machine learning with parallel neural networks for …liny.csie.nctu.edu.tw/document/CSEM-Chen-Sun-Lin-2019.pdfElectricity demand (utilization or load1) forecasting is essential for

Y.-T. Chen et al.

execution time for considering different key points (e.g., core communications, datalocality, dependencies, and memory size). Creel (2016) illustrates the possibility forthe programming language Julia with the message passing interface (MPI) for parallelcomputing on multicore computers or clusters of computers.

Dynamic measures for forecasting andmonitoring update the information availableat the time of evaluation based on error occurrence; that is, the assessment adapts tosome filtered probability space. In a dynamic setting, error measures for processes canbe identifiedwith errormeasures for random variables in an appropriate product space,see Sun et al. (2015). The interrelation of error assessments at different times can becharacterized by the property of time consistency, which can be described either by thesuper-martingale property of the error process or by the principle of prudence from theviewpoint of a decision maker and requires that if the error is tolerant for any scenariotomorrow then it should be tolerant today. In order to assess the performance of oursystem, we propose an efficient measure in our algorithm based on the tail error,which is an extreme error outside an ordinary distribution of outcomes (or simply,higher-than-expected), which causes massive inaccuracy. It has come to signify anybig downward movement in a system’s accuracy. There are different ways to adjustthe system to minimize the tail error, but a popular one is to optimize the algorithmwith respect to its entropy, that is, the trade-off between complexity and accuracy. Inour algorithm, we conduct conventional convex optimization and a simulation (withstylized data) and empirical results with real big data confirm the superior performanceof our proposed method in comparison to alternative models.

In this paper, we first discuss the properties of the newly proposed parallel neu-ral networks (PNNs) and run a simulation comparing with three alternative methods,i.e, the linear seasonal dummy model (LSDM), the autoregressive moving averagemodel (ARMA), and the ARMA model embedded an autoregressive conditional het-eroskedasticity (GARCH) to highlight its performance in term of error controllingwith three conventional criteria for error measurement, i.e., the root mean squarederror (RMSE), mean absolute percentage error (MAPE), and mean absolute error(MAE). We then conduct an empirical study with the proposed PNN for electricitydemand forecasting based on the data from Australia. Based on the feature engineer-ing method proposed by Chen et al. (2018), in our empirical work we recognize fourfeatures: (1) hourly demand in a week day, (2) hourly demand in a weekend, (3) hourlydemand in a day chosen from a period of four weeks around that day, and (4) hourlydemand in a day chosen from a period of four weeks before that day. We perform thePNNs simultaneously on the four features to train the parameter for each method andmake the continuous forecasting with amoving window.We compare the performanceof PNNswith LSDM,ARMA, and the ARMA-GARCH.We show the proposed PNNsmethod has superiority both in simulation and empirical study comparing with othermethods in terms of error minimization.

We organize this article as follows. In Sect. 2, we describe the proposed method-ology in details based on the parallel neural networks (PNNs) for power demandmonitoring and forecasting in order to deal with anomalies efficiently. In Sect. 3, wepresent an empirical study that applies ourmethod to conductmodeling and forecastingpower demands with real big data from Australia. We conclude in Sect. 4.

123

Page 5: Machine learning with parallel neural networks for …liny.csie.nctu.edu.tw/document/CSEM-Chen-Sun-Lin-2019.pdfElectricity demand (utilization or load1) forecasting is essential for

Machine learning with parallel neural networks for…

2 The novel methodology

In this section, we describe the proposed parallel neural networks (PNNs) approachand analytically show its computational properties.

2.1 Simple neural networks

A neural network (NN) is a mathematical or computational model based on biolog-ical NNs. It consists of an interconnected group of artificial neurons and processesinformation using a connectionist approach to computation. In most cases, an NN isan adaptive system that changes its structure based on external or internal informationthat flows through the network during the learning phase. Similar to the linear andpolynomial approximation methods, an NN relates input variables xi , i = 1, . . . , k tooutput variables y j , j = 1, . . . , k, which is essentially amathematicalmodel defining afunction f : X → Y . Each type of NNmodel corresponds to a class of such functions.The difference between a NNmodel and other approximation methods is that NN takeadvantage of one or more hidden layers, in which the input variables are transformedby a special function known as a logistic or log-sigmoid transformation; that is, thefunction f (x) is a composition of other functions gi (x) that can be further defined as acomposition of other functions. Functions f (x) and gi (x) are composed of a set of ele-mentary computational units called neurons,3 which are connected through weightedconnections. These units are organized in layers so that every neuron in a layer is exclu-sively connected to the neurons of the proceeding layer and the subsequent layer. Everyneuron represents an autonomous computational unit and receives inputs as a seriesof signals that dictate its activation. All the input signals reach the neuron simultane-ously, and the neurons can receivemore than one input signal. Following the activationdictated from the input signals, the neurons produce the output signals. Every inputsignal is associated with a connection weight that determines the relative importanceof the input signals in generating the final impulse transmitted by the neuron.

Formally, the described algorithm can be expressed as follows:

nk,t = wk,0 +i∗∑

i=1

wk,i xi,t , (1)

Nk,t = G(nk,t ), (2)

yt = γ0 +k∗∑

k=1

γk Nk,t , (3)

where G(·) represents the activation function and Nk,t stands for the neurons. In thissystem, there are i∗ input variables x and k∗ neurons. A linear combination of theseinput variables observed at time t , that is, xi,t , i = 1, . . . , i∗, with the coefficientvector (i.e., a set of input weights) wk,i , i = 1, . . . , i∗, and a constant term wk,0 formthe variable nk,t . This variable nk,t is transformed by the activation function G(·) to a3 Sometimes such elementary computational units are called nodes, neuron nodes, units, or processingelements.

123

Page 6: Machine learning with parallel neural networks for …liny.csie.nctu.edu.tw/document/CSEM-Chen-Sun-Lin-2019.pdfElectricity demand (utilization or load1) forecasting is essential for

Y.-T. Chen et al.

neuron Nk,t at time (or observation) t . The set of k∗ neurons at time (or observation)index t are combined in a linear way with the coefficient vector γk , k = 1, . . . , k∗ andtaken with a constant term γ0 to form the output value yt at time t . In defining a NNmodel, the activation functionG(·) is typically one of the elements to specify. There arethree commonly employed types of activation functions: linear, stepwise, and sigmoid.

Neurons of the NN model are organized in layers. There are three types of layers:input, output, and hidden. The input layer receives information only from the externalinformation, that is, an explanatory variable xi . There is no any calculation performedfor the input layer. It only transmits information to the next level. The output layeronly produces the final results sent by the network to outside of the system, that is,response variable y j . Between the input and output layers there can be one or moreintermediate layers, called hidden layers, so namedbecause these layers are not directlyconnected with the external information. Giudici points out that the architecture ofan NN model refers to the network’s organization: (1) the number of layers, (2) thenumber of neurons belonging to each layer, (3) the manners in which the neurons areconnected, and (4) the direction of flow for the computation.

Different information flows lead to different types of network. The NN model canby divided into two types based on the information flow: feed-forward networks andrecurrent networks. In the feed-forward network, the information moves in only onedirection: from the input layer through the hidden layer and to the output layer. Thereare no cycles or loops in this type of network. Equations (1)–(3) describe the feed-forward networks. In contrast to feed-forward networks, recurrent networks aremodelswith bi-directional information flows that enable the neurons to depend not only on theinput variable xi but also on their own lagged values nk,t−p at order p. The recurrentnetwork illustrate dependence in the evolution of the neurons. Replacing Eq. (1) withthe following Eq. (4), a recurrent network can be formed by

nk,t = wk,0 +i∗∑

i=1

wk,i xi,t +k∗∑

k=1

φk nk,t−p, (4)

where φk stands for the weight of lagged value nk,t−p at order p. The recurrentnetwork has an indirect feedback effect from the lagged unsquashed neurons to thecurrent neurons, not a direct feedback from lagged neurons to the level of output. Inour study, we focus on the feed-forward network.

An NN model modifies its interconnection weights by applying a set of learning(training) samples. The learning process leads to parameters of a network that representimplicitly storedknowledge from thedata.Moregenerally, given a specific task to solveand a class of functions F , learningmeans using a set of observations (learning/trainingsamples) in order to find f ∗ ∈ F which solves the task in an optimal sense. Thisentails defining a cost function C : F → � such that, for the optimal solution f ∗,C( f ∗) ≤ C( f ) ∀ f ∈ F ; that is, no solution has a cost less than that of the optimalsolution. Figure 1 shows a simple NN, that is, a three-layer fully connected NN withM neurons. It comprises an input layer, one hidden layer and an output layer. Thehidden layer is made up of a number of neurons um (m = 1, . . . , M), which act asfeature detectors. As the learning progresses, the hidden neurons begin to graduallydiscover the salient relationship between input historical uses and the forecasted use.

123

Page 7: Machine learning with parallel neural networks for …liny.csie.nctu.edu.tw/document/CSEM-Chen-Sun-Lin-2019.pdfElectricity demand (utilization or load1) forecasting is essential for

Machine learning with parallel neural networks for…

Fig. 1 A basic neural network model

Fig. 2 A neuron

Figure 2 demonstrates how a neuronworks. At the beginning of the training process,each neuron connects to the input vector (i.e., historical similar power usages xi in ourexample) with initial weights and performs inner product. As we know, inner productsusually are used to quantify the similarity of two vectors. Because similarity rangesvary, the activation function A(·) is introduced to scale the range and make it easilyto compare similarities. In our case, we choose the sigmoid function as an activationfunction for its range limits in [0,1] because of its differentiable and non-decreasingproperties. Each neuron can be expressed as in Eq. (5), where am is the output of aneuron um , βm denotes the inner product result, wm,i are weights, and A(·) is thesigmoid function.

am = A(βm) = A(

D∑

d=1

wm,i xi,d

)(5)

In the output layer, all weights (i.e., w1,1, . . . , w1,M ) between the hidden layer andthe output zi evaluate each neuron’s importance. The output layer reports the resultof the inner product between all hidden neurons’ outputs a1, . . . , aM and the corre-sponding weights, and applies an activation function to derive the output z̃. Therefore,the overall network function can be formally expressed as Eq. (6).

z̃(xi ,w) = A(

M∑

m=1

w1,mam

)= A

(M∑

m=1

w1,mA(

D∑

d=1

wm,i xi,d

))(6)

123

Page 8: Machine learning with parallel neural networks for …liny.csie.nctu.edu.tw/document/CSEM-Chen-Sun-Lin-2019.pdfElectricity demand (utilization or load1) forecasting is essential for

Y.-T. Chen et al.

Fig. 3 A parallel neural network

2.2 Parallel neural networks

Figure 3 shows a parallel neural network (PNN) that consists of two parts. The firstpart is a number of simple three-layer NNs: NNq (q = 1, . . . , n). Each NN tries tocapture the most salient features with a distinct and predefined number of neurons.The second part is the evaluation module, which determines the optimal weight w∗of the neural network. Training basically consists of determining the weights, whichachieve the desired target values by minimizing the sum of the squared estimated errorE(x,w) over all training samples, where

E(x,w) =N∑

i=1

e(xi ,w). (7)

In Eq. (7), function e(xi ,w) is the estimated error of the training sample xi . To makethe expression simple, we omit the corresponding input vector and express E(x,w) asE(w). The training process is shown in Eq. (8):

e(xi ,w) =(1

2

) [zi − z̃(xi ,w)

]2. (8)

To take loading balance into consideration, we equally partition the optimization workto each processor. At the beginning of the training process, NNq is assigned a distinctand predefined neuron number and initialized with initial weights wq .

123

Page 9: Machine learning with parallel neural networks for …liny.csie.nctu.edu.tw/document/CSEM-Chen-Sun-Lin-2019.pdfElectricity demand (utilization or load1) forecasting is essential for

Machine learning with parallel neural networks for…

2.2.1 Feed-forward computation and backward error propagation

The input will be sent to each NN simultaneously for training, and NNq tries to find itsoptimal weight vector such that E(·) takes its global minimum. As E(·) is a continuousfunction, its global minimum can be reached when �E(·) = 0. Therefore we followconventional NN techniques to train eachNN by iteratively applying batch algorithms,which are made up of feed-forward computations and backward error propagation. Inthe feed-forward computation phase, the information flows from the input layer to theoutput layer by applying the givenwq to all the input vectors. We will get the estimatez̃i corresponding to each input vector xi according to Eq. (6) and together with theestimated error e(xi ,wq).

In the backward error propagation phase, we need to be able to compute the gradientof the error function with respect to each weight in the network. It tells us how a smallchange in that weight will affect the overall error E(·). Applying the chain rule derivesthe gradient of the error function with respect to all weights between the hidden layerand the output layer as expressed in Eq. (9).

∂e

∂w1,m= −(z − z̃)

(∂ z̃

∂w1,m

)= −(z − z̃)

[∂A(β1)

∂w1,m

]

= −(z − z̃)A′(β1)

(∂β1

∂w1,m

)= −(z − z̃)A′

(β1)am . (9)

We also need to adjust all weights connecting input nodes and hidden neurons. Toobtain the weight update vector betweeen the input layer and the hidden layer, wekeep propagating the errors to the previous layer, as follows:

∂e

∂wm,i= − (z − z̃)

(∂ z̃

∂wm,i

)= −(z − z̃)

[∂A(β1)

∂wm,i

]

= −(z − z̃)A′(β1)

(∂β1

∂wm,i

)

= −(z − z̃)A′(β1)

∂wm,i

[M∑

m=1

w1,mA(βm)

]

= −(z − z̃)A′(β1)w1,m

[∂A(βm)

∂wm,i

]

= −(z − z̃)A′(β1)w1,mA′

(βm)

(∂βm

∂wm,i

)

= −(z − z̃)A′(β1)w1,mA′

(βm)xi (10)

In the weight adjustment phase, we resort to a gradient descent approach. Theweight adjustment �wq is derived from propagating the error backward, see Eqs. (9)–(10). When the stopping criterion is satisfied (i.e., �wq = 0), the weight wq andcorresponding error E(wq)will be sent to the model evaluation module for comparingthe actual performance, then the training process of NNq is terminated. Otherwise,the weight wq is updated as follows:

123

Page 10: Machine learning with parallel neural networks for …liny.csie.nctu.edu.tw/document/CSEM-Chen-Sun-Lin-2019.pdfElectricity demand (utilization or load1) forecasting is essential for

Y.-T. Chen et al.

wq = wq + �wq (11)

and the training process of NNq proceeds to the next epoch.

2.2.2 Model evaluation

In order to avoid over-fitting, we design a regularization approach, which involvesadding a penalty term to the error function E(w) for discouraging the coefficientsfrom reaching large values, as the form shown in following Eq. (12):

E(w) =N∑

i=1

e(xi ,w) + α‖w‖1 (12)

where ‖w‖1 = ∑ | w |, and the parameter α governs the relative importance of theregularization term compared with the estimated error term. From allwq derived fromNNq , the evaluation module returns the weight as the optimal weight w∗ for whichthe error function E(·) attains its minimum value.

2.2.3 Summary

Algorithm 1 Parallel Neural NetworkInput: observations: a set of observations, each with input vector x and target value z;

networks: a set of three-layer neural networks, eachwith a predefined number of neurons and activationfunctionA(·);

Output: w∗(in parallel do)for q := 1 to the number of elements in networks do

initialize weights wq in the network NNq ;repeat

perform a feed-forward operation. (see Eq. (6))compute the error between the target value and the estimate. (see Eq. (7))propagate the error backwardly. (see Eqs. (9)–(10))update weights wq in the network. (see Eq. (11))

until satisfying stopping criterionend forEvaluate networks. w∗ ← argmin(E(wq ))

We summarize the complete flowchart of PNN in Fig. 4 with the pseudocode shownby Algorithm 1.

3 Simulation study

Let y1, y2, . . . , yt be a time series. At time t for t ≥ 1, the next value yt+1 willbe predicted based on the observed realizations (for training or monitoring) of

123

Page 11: Machine learning with parallel neural networks for …liny.csie.nctu.edu.tw/document/CSEM-Chen-Sun-Lin-2019.pdfElectricity demand (utilization or load1) forecasting is essential for

Machine learning with parallel neural networks for…

Fig. 4 Flowchart of the parallel neural network (PNN)

yt , yt−1, yt−2, . . . , y1. In order to confirm the properties we derived, we perform aMonte Carlo simulation with stylized data, that is, the data patterns are predetermined.

3.1 The data

The simulated power demand series, yt , is generated according to an additive model:yt = xt+εt , ε ∼N(0,σ 2),where xt is the deterministic term that considers seasonalitythat observed in practice and εt is the nondeterministic part, that is, the random term thatstands for the events not being perfectly predicted and that follow a normal distributionN(0, σ 2).

We simulate two data patterns in our study. For the Pattern 1 data, we use thetrigonometric model suggested by Chen et al. (2010) for the seasonality of powerdemand as follows:

xt = s0 +2∑

i=1

si sin( fi t + θi ),

where si , and θi , i = 0, . . . , 2 are constants. For simulating the Pattern 2 data, weextend the preceding equation with an additional seasonality term, as follows:

xt = s0 +2∑

i=1

si sin( fi t + θi ) +2∑

j=1

s j cos( f j t + θ j ),

where si , θi , and θ j , i, j = 0, . . . , 2 are constants.

123

Page 12: Machine learning with parallel neural networks for …liny.csie.nctu.edu.tw/document/CSEM-Chen-Sun-Lin-2019.pdfElectricity demand (utilization or load1) forecasting is essential for

Y.-T. Chen et al.

Fig. 5 Comparison of in-sample modeling performances of PNN with other alternative methods measuredby mean and variance of RMSE for two different stylized data patterns

Letting N denote the length of the sample, we set N=5000, and run 20,000 replica-tions for each simulation. The subsample series that are used for the in-sample studyare randomly selected by a moving window with length T . Replacement is allowed inthe sampling. Letting TF denote the length of the forecasting series, we perform theone day ahead out-of-sample forecasting (1 ≤ T ≤ T +TF ≤ N ). In our analysis, thesubsample length (i.e., the window length) of T = 672 (approximately two weeks)was chosen for the in-sample simulation and TF = 48 (approximately one day) for theout-of-sample forecasting. A total of 9060 sub-samples (1812 sub-samples for eachregion during five years) were randomly created.

3.2 Alternative models

In the simulation study, we compare the performance of the PNN frameworkwith threecompeting models: the linearly seasonal dummy model (LSDM), the AutoregressiveMoving Average (ARMA)model, and the ARMAmodel with the noise that illustratesthe autoregressive conditional heteroskedasticity (GARCH) effect. The analysis is

123

Page 13: Machine learning with parallel neural networks for …liny.csie.nctu.edu.tw/document/CSEM-Chen-Sun-Lin-2019.pdfElectricity demand (utilization or load1) forecasting is essential for

Machine learning with parallel neural networks for…

Fig. 6 Comparison of out-of-sample forecasting performances of PNN with other alternative methodsmeasured by mean and variance of RMSE for two different stylized data patterns

performed for both in-sample (monitoring or training) and out-of-sample (forecasting)investigations with two data patterns described in the previous section. We comparethe performance of model fitting by using root mean squared error (RMSE), meanabsolute percentage error (MAPE), and mean absolute error (MAE), as the goodness-of-fit measures.

The LSDM with s (s ≥ 1) seasonality dummy variables is described as follows:

yt = α0 +s∑

i=1

αi Dit + εt , (13)

where Dit stands for the seasonal dummy, Dit = 1 if s is the i th period, otherwiseDit = 0.

The ARMA(r,m) model is defined with the following formula:

yt = α0 +r∑

i=1

αi yt−i + εt +m∑

j=1

β jεt− j (14)

123

Page 14: Machine learning with parallel neural networks for …liny.csie.nctu.edu.tw/document/CSEM-Chen-Sun-Lin-2019.pdfElectricity demand (utilization or load1) forecasting is essential for

Y.-T. Chen et al.

Fig. 7 Comparison of in-sample modeling performances of PNN with other alternative methods measuredby mean and variance of MAPE for two different stylized data patterns

where ε ∼ N (0, 1). Equation (14) also serves as the conditional mean for the ARMA-GARCH model. Let εt = σt ut , where the conditional variance of the innovations,σ 2t , is by definition

Vart−1(yt ) = Et−1 (ε2t ) = σ 2t

The general GARCH(p,q) processes for the conditional variance of the innovation isthen

σ 2t = κ +

p∑

i=1

γi σ2t−i +

q∑

j=1

θ j ε2t− j (15)

In our study, all models, that is, Eqs. (13)–(15), have been parameterized withεt ∼ N (0, 1), see Sun et al. (2007) for the details about modeling and parameterizingresiduals under different conditions.

123

Page 15: Machine learning with parallel neural networks for …liny.csie.nctu.edu.tw/document/CSEM-Chen-Sun-Lin-2019.pdfElectricity demand (utilization or load1) forecasting is essential for

Machine learning with parallel neural networks for…

Fig. 8 Comparison of out-of-sample forecasting performances of PNN with other alternative methodsmeasured by mean and variance of MAPE for two different stylized data patterns

3.3 Results

We conduct a simulation study to investigate the performance of PNNs. The purposeof this simulation study is twofold. First, we show that for any arbitrary data, a PNNwill result in a better performance not only in the in-sample modeling but also in theout-of-sample forecasting compared to others. Second, we illustrate the properties ofPNNs in particular by showing the consistency of our algorithm, that is, the errorgenerated by a PNN is bounded and less than those of alternative methods. For eachpattern, we compute RMSE, which measures the distance between the true signal andits approximation. In Figs. 5 and 6, we display the mean and variance of RMSE withrespect to the in-sample and out-of-sample performance. The smaller these values are,the better the performance of the underlying model will be. We then can identify thatthe mean and variance of RMSE for PNNs are obviously smaller than other modelsfor simulated patterns 1 and 2 data. For the in-sample modeling, we can observe errordivergency for alternative methods based on the variance of RMSE in Panels 1 and 2of Fig. 5. It shows that when sample size increases, the errors of alternative methodsexhibit a growing tendency. On the contrary, the error of PNN illustrates stability, that

123

Page 16: Machine learning with parallel neural networks for …liny.csie.nctu.edu.tw/document/CSEM-Chen-Sun-Lin-2019.pdfElectricity demand (utilization or load1) forecasting is essential for

Y.-T. Chen et al.

Fig. 9 Comparison of in-sample modeling performances of PNN with other alternative methods measuredby mean and variance of MAE for two different stylized data patterns

is, the error tends not to increase when enlarging the sample size. We could observe asimilar profile for the out-of-sample forecasting, see Panels 2 and 4 in Fig. 6.

We compare the performance of all methods measured by the mean of MAPE inFig. 7 and variance of MAPE in Fig. 8. Panels 1 and 3 in Fig. 7 exhibit similar profileand show the PNN has smallest value of MAPE for Data Pattern 1 and 2 in the in-sample modeling. For the out-of-sample forecasting, Panels 1 and 3 in Fig. 8 show thatthe error altitude of all methods increases but the value of PNN is still the smallest. Forthe variance of MAPE, the PNN has the smallest value, see Panels 2 and 4 in Figs. 7and 8. Particularly, the variances of MAPE in both the in-sample and out-of-sampleillustrate consistency in value fluctuation.

We also compute MAE for each pattern. In Figs. 9 and 10, we display the meanand variance of MAE for the in-sample and out-of-sample performance respectively.Similar to the results based on RMSE and MAPE, we can find out that the mean andvariance of MAE for PNNs are obviously smaller than other models for simulatedpatterns 1 and 2 data. Comparing with the results based on RMSE and MAPE, theMAE measurement illustrates similar profile.

We therefore can conclude that the proposed PNN procedure performs better thanother models for the two data patterns we investigated in our simulation study.

123

Page 17: Machine learning with parallel neural networks for …liny.csie.nctu.edu.tw/document/CSEM-Chen-Sun-Lin-2019.pdfElectricity demand (utilization or load1) forecasting is essential for

Machine learning with parallel neural networks for…

Fig. 10 Comparison of out-of-sample forecasting performances of PNN with other alternative methodsmeasured by mean and variance of MAE for two different stylized data patterns

4 Empirical study

We then apply the real data from the Australian electricity market whose patternswere originally preserved for an empirical study. We use the same procedure wehave used in our simulation study for the in-sample (monitoring-training) and out-of-sample (forecasting) investigations. In addition, we apply a robust test based on twotransformation-invariant metrics to verify our empirical results.

4.1 The data

In our empirical study, we use the high-frequency (i.e., half-hourly) data of electricitydemand inAustralia collected fromJanuary 1, 2009, toDecember 31, 2014, released bythe National Electricity Market (NEM), which is the Australian wholesale electricitymarket. The NEM operates in five interconnected regions: New South Wales (NSW),Queensland (QA), South Australia (SA), Tasmania (TAS), and Victoria (VIC). SunandMeinl (2012) highlight some features of high-frequency data and suggest powerfuldenoising tools to smooth the data. Dealing with the abnormal and unusual seasonality

123

Page 18: Machine learning with parallel neural networks for …liny.csie.nctu.edu.tw/document/CSEM-Chen-Sun-Lin-2019.pdfElectricity demand (utilization or load1) forecasting is essential for

Y.-T. Chen et al.

is the primary concern in demand forecasting; therefore, we maintain the originalityof the data without moving any outliers and cleaning the data in our study.

4.2 Sequence alignment for demand similarity

A similar day method stemming from sequence alignment is based on the fractalassumption of the historical data that have similar characteristics to the day beingforecasted, see Sun et al. (2007), and references therein. Many characteristics canbe extracted for demand patterns, such as time of day, nature of the day (weekday,weekend, or holiday), business day variation across months, and seasonal anomalies.

In our study, t refers to time, which stands for a day or a minute, and we use �t toidentify a specific time period. For example, given t as a day,�t = 0means the currentday (or “today”) we consider; when�t ∈ [−14, 14], it means the period of two weeksbefore and after today. We then define four indicators with capital letters, that is, H�t ,D�t , W�t , and Y�t , in order to classify easily the similar days with respect to ourtarget day, and H refers to the hour indicator, D is the date indicator, W indicates dayof week in a given year, and Y shows the year. As we have learned that the subscript�t shows the length of the period of the considered indicators. Therefore, any powerdemand can be expressed as x(H�t ,D�t ,Y�t ).

We introduce four types of similar demand days with respect to our target day’sdemand x(H0,D0,Y0):

• xi,1 is the first type of the similar demand day and defined as x(H0,D0,Y−1), whichmeans the demand is in the same hour on the same day one year ago.

• xi,2 is the second type of the similar demand day and defined as x(H0,W0,Y−1), whichmeans the demand is in the same hour on the same day of the week one year ago.

• xi,3 is the third type of the similar demand day and defined as x(H0,D[−14,+14],Y−1),which means the demand is in the same hour in a day chosen from a period of fourweeks around the same day one year ago.

• xi,4 is the fourth type of the similar demand day and defined as x(H0,D[−28,0],Y0),which means the demand is in the same hour in a day chosen from a period fromfour weeks before the target day.

When considering xi,2, we face the anomaly of day of the week because of theirregular switching between working days and nonworking days. Electricity demandon working days usually follows a regular pattern, but on nonworking days it does not.Nonworking days (i.e., weekend and public holidays) can be separated into two groupsbased on which day of the week in the calendar: fixed position or floating position. Allweekends and some public holidays for a specific day of week, for example, EasterMonday and the Queen’s birthday, belong to the first group. Other public holidayswhose positions float to different days of the week are in the second group.

The work week and weekend are two complementary parts of a week devoted towork and rest, respectively. The legal work week is the part of the seven-day weekdevoted to labor. In most Western countries it is Monday through Friday; the weekendcomprises Saturday andSunday.4 Aweekday is any day of theworkweek. If aweekday

4 Some people extend the weekend to Friday nights as well. In some Christian traditions, Sunday is the“Lord’s Day” and the day of rest and worship. Jewish Shabbat or Biblical Sabbath lasts from sunset on

123

Page 19: Machine learning with parallel neural networks for …liny.csie.nctu.edu.tw/document/CSEM-Chen-Sun-Lin-2019.pdfElectricity demand (utilization or load1) forecasting is essential for

Machine learning with parallel neural networks for…

coincides with a public holiday, we can observe different power demand patterns. Wealso observe somepatternswhenpublic holidays overlap theweekend.We then classifynon-weekend public holidays (i.e., public holidays that do not overlap the weekend)as two types: adhered with weekends and adhered with weekdays.

4.3 The result

We use the same setting and moving window design as we did in the simulation. Forthe moving window design, we set the in-sample size as one-month and the out-of-sample size as one-day-ahead forecasting. The moving window shifts 338 times forthe data of one year. The moving window shifts once, we get a set of parameters forall methods. We have the data sampled for five year and the moving window totallyshifts 1690 times. We provide descriptive statistics of parameters for each method inTable 1.

Table 2 reports the in-sample results for theARMA(2,1),ARMA(2,1)-GARCH(1,1)and LSMD models. We see that the measure (i.e., MAPE, MAE, and RMSE) of in-sample error is reducedwhen applying a PNNcomparedwith the error of othermodels.In addition, the PNN reduces the MAPE by 39%, 44%, and 44% compared with theresults from the ARMA(2,1), ARMA(2,1)-GARCH(1,1), and LSMD models. ThePNN also reduces MAE on average by 42.3%, 46.9%, and 49.2% when comparedwith the ARMA(2,1), ARMA(2,1)-GARCH(1,1), and LSMD models, respectively.The PNN reduces RMSE by 46.7%, 52.6%, and 49.5% on average when comparedwith the ARMA(2,1), ARMA(2,1)-GARCH(1,1), and LSMD models as well. Wereport the results of out-of-sample forecasting performances of the PNN, ARMA(2,1),ARMA(2,1)-GARCH(1,1), and LSMD models in Table 2. The PNN increases theaverage accuracy of the MAPE by 45.6%, 45%, and 45% when compared to theARMA(2,1), ARMA(2,1)-GARCH(1,1), and LMSD models, respectively. Based onthe measure of MAE, the PNN also increases the average accuracy by 50.8%, 50.2%,and 39.1% when compared with the ARMA(2,1), ARMA(2,1)-GARCH(1,1), andLMSDmodels, respectively. UsingRMSEas the accuracymeasure, the PNN improveson average the accuracy by 48.4%, 47.9%, and 36.4% when compared with theARMA(2,1), ARMA(2,1)- GARCH(1,1), and LMSD models as well. We find a gen-erally significant improvement provided by the PNN when compared with the otherthree alternative models.

4.3.1 Robustness testing

We perform a robustness test based on twometrics: Kolmogorov–Smirnov and Kuipermetrics applied by Chen et al. (2019) for goodness-of-fit evaluation. Baucells and Bor-gonovo (2013) illustrate the advantages of using Kolmogorov–Smirnov and Kuipermetrics because they have good invariant probabilistic features when conductingmonotonic transformations and are easy to interpret.

Friday to the fall of full darkness on Saturday, leading to a Friday-Saturdayweekend in Israel. SomeMuslim-majority countries historically had a Thursday-Friday or Friday-Saturday weekend; however, recently manysuch countries have shifted fromThursday-Friday to Friday-Saturday or to Saturday-Sunday; seeWikipedia.

123

Page 20: Machine learning with parallel neural networks for …liny.csie.nctu.edu.tw/document/CSEM-Chen-Sun-Lin-2019.pdfElectricity demand (utilization or load1) forecasting is essential for

Y.-T. Chen et al.

Table1

Descriptiv

estatisticsforparametersof

ARMA(2,1),ARMA(2,1)-GARCH(1,1),LSD

M,and

PNN(1690iteratio

ns)

ARMA(2,1)

ARMA(2,1)-GARCH(1,1)

LSD

M∗

PNN

α1

α2

βα1

α2

βγ

θκ

α0

α1

ωφ

NSW

Mean

0.5774

0.01

31−0.00

120.64

28−0.01

40−0.00

810.08

930.79

840.00

006.70

062.18

97−0.12

660.18

92

Var

0.09

260.08

200.01

470.08

750.09

570.01

920.08

210.22

100.00

000.33

261.21

050.87

711.27

73

Min

0.32

34−0.21

95−0.05

240.46

66−0.18

06−0.06

860.04

980.01

700.00

006.19

61−0.21

48−2.11

09−2.57

82

Max

0.81

820.18

810.02

650.82

890.19

800.03

690.48

090.90

000.00

007.36

465.45

981.88

503.26

77

QLD

Mean

0.52

810.05

720.00

440.52

930.04

890.00

280.12

880.66

370.00

004.61

481.40

02−0.10

180.31

29

Var

0.07

130.07

060.00

950.09

600.07

680.00

940.08

260.27

920.00

000.20

420.81

610.88

221.31

82

Min

0.32

83−0.08

29−0.02

880.22

45−0.11

51−0.02

700.04

980.00

020.00

004.33

29−0.02

68−2.12

70−2.66

62

Max

0.67

920.26

440.02

750.70

950.23

300.02

740.38

190.90

000.00

005.15

452.85

521.81

413.46

25

SAMean

0.3233

0.12

58−0.00

460.33

180.13

08−0.00

510.13

540.58

390.00

011.16

360.37

80−0.21

73−0.22

65

Var

0.09

790.05

220.01

210.09

030.05

570.01

280.06

760.29

210.00

010.07

590.23

910.85

981.08

91

Min

0.04

390.00

71−0.05

510 .12

170.02

00−0.05

600.04

910.01

580.00

001.04

30−0.06

55−2.21

93−2.62

79

Max

0.55

580.28

460.02

700.55

880.26

680.02

670.27

780.90

010.00

021.52

091.25

851.89

392.52

55

TAS

Mean

0.22

680.00

550.00

150.23

190.00

950.00

220.14

370.59

390.00

010.96

110.17

15−0.09

720.38

09

Var

0.10

670.05

770.00

490.10

620.05

440.00

740.09

660.31

580.00

010.03

920.12

060 .84

281.18

96

Min

0.00

04−0.13

50−0.01

43−0.02

13−0.10

30−0.01

220.04

930.00

770.00

000.84

19−0.01

69−2.07

21−2.56

17

Max

0.50

310.16

130.02

450.45

020.15

790.07

920.35

360.90

000.00

021.04

720.50

151.84

543.51

02

VIC

Mean

0.51

260.10

750.00

760.52

690.10

580.00

510.14

050.57

520.00

004.76

881.10

65−0.16

450.01

25

Var

0.07

160.07

050.00

780.08

360.08

720.01

120.08

670.31

420.00

000.19

030.66

970.80

791.19

36

Min

0.33

47−0.04

44−0.01

260.35

36−0.10

74−0.03

920.04

980.00

760.00

004.43

45−0.18

30−2.02

28−2.53

57

Max

0.67

350.27

210.02

840.71

530.32

380.02

820.36

320.90

000.00

015.21

553.36

541.87

272.91

29

*×10

3

123

Page 21: Machine learning with parallel neural networks for …liny.csie.nctu.edu.tw/document/CSEM-Chen-Sun-Lin-2019.pdfElectricity demand (utilization or load1) forecasting is essential for

Machine learning with parallel neural networks for…

Table2

Com

parisonof

goodness-of-fit

ofthePN

Nwith

otheralternativemethods,that

is,ARMA(2,1),ARMA(2,1)-GARCH(1,1),andlin

earseasonal

dummymodel

(LSD

M)forthein-sam

ple(m

onito

ring

/training

)performance

bytheaveragevalueof

themeanabsolute

error(M

AE),meanabsolute

percentage

error(M

APE

),androot

meansquarederror(R

MSE

)

Year

Region

PNN

ARMA(2,1)

ARMA(2,1)-GARCH(1,1)

LSD

M

MAPE

MAE

RMSE

MAPE

MAE

RMSE

MAPE

MAE

RMSE

MAPE

MAE

RMSE

(×10

−2)

(×10

2)

(×10

2)

(×10

−2)

(×10

2)

(×10

2)

(×10

−2)

(×10

2)

(×10

2)

(×10

−2)

(×10

2)

(×10

2)

2010

NSW

2.91

292.49

933.49

544.65

794.25

026.00

365.09

574.65

496.58

985.32

364.71

866.28

86

QLD

2.33

681.37

091.83

513.42

062.13

563.24

543.75

512.34

543.56

704.25

742.59

803.51

50

SA5.57

380.82

561.09

819.58

341.49

582.22

179.39

291.47

512.16

3510

.382

01.59

442.19

96

TAS

3.22

090.36

700.47

935.60

130.65

690.88

985.54

950.65

160.88

024.71

040.54

230.70

35

VIC

3.07

401.74

382.38

545.47

213.29

284.92

275.94

703.58

075.34

977.23

304.22

525.61

98

2011

NSW

2.95

712.52

383.36

214.70

284.23

496.06

395.11

584.61

856.65

385.35

234.71

796.41

81

QLD

2.45

181.40

291.89

613.45

502.10

313.19

073.80

422.31

593.49

654.30

332.56

163.47

85

SA5.98

410.86

501.15

589.36

371.42

762.08

429.73

101.48

772.14

619.64

611.43

941.97

44

TAS

3.36

480.37

800.49

935.48

190.63

380.85

175.57

780.64

460.86

375.08

000.58

090.74

25

VIC

2.95

801.65

682.20

535.60

583.28

154.83

306.11

903.58

585.27

656.77

873.86

315.10

05

2012

NSW

2.63

142.12

942.83

354.60

773.89

445.46

605.10

014.31

256.04

435.25

074.31

865.75

55

QLD

2.39

191.37

011.83

353.54

922.14

953.23

753.87

632.34

663.52

804.30

402.55

743.54

75

SA5.45

960.77

821.03

879.54

901.44

842.13

969.62

421.45

952.16

799.68

721.42

952.03

65

TAS

3.23

800.35

150.45

925.63

970.62

710.84

635.88

540.65

210.87

704.80

040.52

260.67

96

VIC

3.09

711.71

282.35

395.62

133.28

204.89

456.09

133.55

615.29

576.67

233.78

865.16

38

123

Page 22: Machine learning with parallel neural networks for …liny.csie.nctu.edu.tw/document/CSEM-Chen-Sun-Lin-2019.pdfElectricity demand (utilization or load1) forecasting is essential for

Y.-T. Chen et al.

Table2

continued

Year

Region

PNN

ARMA(2,1)

ARMA(2,1)-GARCH(1,1)

LSD

M

MAPE

MAE

RMSE

MAPE

MAE

RMSE

MAPE

MAE

RMSE

MAPE

MAE

RMSE

(×10

−2)

(×10

2)

(×10

2)

(×10

−2)

(×10

2)

(×10

2)

(×10

−2)

(×10

2)

(×10

2)

(×10

−2)

(×10

2)

(×10

2)

2013

NSW

3.09

062.41

283.27

855.00

774.08

995.85

965.41

614.43

506.35

135.55

114.46

286.07

53

QLD

2.31

851.30

591.72

043.62

612.13

173.15

873.94

952.32

073.44

734.16

922.40

033.21

38

SA6.35

620.86

871.15

629.48

001.38

552.01

689.56

911.40

022.04

4910

.454

01.49

942.12

59

TAS

3.39

330.37

800.49

186.17

130.70

450.94

176.34

180.72

210.96

755.27

530.59

500.75

93

VIC

3.57

621.91

512.61

806.10

663.49

365.21

026.54

133.74

595.57

267.53

104.17

275.71

69

2014

NSW

2.50

161.93

872.59

644.60

753.74

385.37

075.05

764.11

255.89

805.61

874.45

495.78

00

QLD

2.48

851.40

801.89

143.95

152.34

533.46

914.34

582.57

963.81

454.72

532.76

163.77

53

SA7.14

400.95

991.26

379.83

851.39

632.04

389.61

801.37

302.01

6511

.246

01.56

652.27

62

TAS

3.34

760.36

820.48

676.30

340.70

570.93

6113

.896

01.44

975.26

494.69

040.51

870.66

31

VIC

3.99

842.07

552.79

566.11

963.33

225.03

236.40

513.50

005.27

698.08

704.29

475.97

31

Mean

3.59

471.34

421.80

925.90

092.32

973.39

726.47

222.53

303.82

226.44

522.64

743.58

33

Variance

0.01

8950

.227

893

.999

30.04

2515

8.43

9833

6.13

360.06

1318

7.68

2639

3.94

780.04

9823

6.79

6442

7.22

68

123

Page 23: Machine learning with parallel neural networks for …liny.csie.nctu.edu.tw/document/CSEM-Chen-Sun-Lin-2019.pdfElectricity demand (utilization or load1) forecasting is essential for

Machine learning with parallel neural networks for…

Kolmogorov–Smirnov’s (KS) metric and Kuiper’s metric, and are defined as fol-lows:

KS := supx∈R

|Fn(S) − F(S̃)|, and

K := supx∈R

(Fn(S) − F(S̃)

) + supx∈R

(F(S̃) − Fn(S)

).

where Fn(S) denotes the empirical sample distribution of S, and F(S̃) is the distribu-tion function of the approximation (i. e., the output of the PNN). We conclude that thesmaller these metrics are, the better the performance will be for the underlying model(Table 3).

In Table 4, we find find that the mean value of Kolmogorov–Smirnov’s metric ofthe PNN is smaller for both in-sample and out-of-sample tests than the other models,which confirms that the PNN performs better than other three models. Particularly, themean value of KSmetric for PNN in the in-sample study is 24.28% less than that of theARMA model, 26.62% less than the ARMA-GARCH model, and 69.14% less thanthe LSDM and the corresponding error reduction for the out-of-sample forecasting is51.32%, 49.56%, and 31.20% in comparison with other models respectively.

We observe similar results for Kuiper’s metric to verify the superior performanceof the PNN compared with other alternative models. In addition for the out-of-sampleforecasting, the mean value of Kuiper metric for PNN is 36.75% less than that of theARMA model, 41.54% less than the ARMA-GARCH model, and 39.35% less thanthe LSDM.

4.3.2 Discussion

The proposed PNNs essentially implements a mapping function in a parallel way frominput to output, andmathematical theory has proven that it has the ability to implementany complex nonlinear mapping, which makes it particularly suitable for solvingproblems with complex internal mechanisms. In our empirical study, the electricitydemand has illustrated complex internal features. For example, the demand in weekday and weekend has different patterns, particularly when the public holidays addheterogeneity on the variation of them. The proposed PNNs can automatically extractreasonable rules by learning a set of examples with recognized variation, that is, it hasthe ability to learn by itself to distinguish the demand heterogeneity. In addition, thePNN has certain capability for generalization.

We apply two time series models in our study. ARMA model essentially only cap-tures linear relationships. Prediction of time series data using ARMA model requiresstationarity such that data has no trend or seasonality; that is, its mean value has aconstant amplitude on the time axis, and its variance tends to the same stable valueon the time axis. If the data is not stationary, it is impossible to capture regularity.Although GARCH model can deal with overlaying changes, its conditional varianceis not only a linear function of the squared lagging residual, but also a linear function ofthe lagging conditional variance. Its capability for nonstationarity has been weakened.On the other hand, LSDM can predict seasonal time series. However, high-frequency

123

Page 24: Machine learning with parallel neural networks for …liny.csie.nctu.edu.tw/document/CSEM-Chen-Sun-Lin-2019.pdfElectricity demand (utilization or load1) forecasting is essential for

Y.-T. Chen et al.

Table3

Com

parisonof

goodness-of-fit

ofthePN

Nwith

otheralternativemethods,that

is,ARMA(2,1),ARMA(2,1)-GARCH(1,1),andlin

earseasonal

dummymodel

(LSD

M)fortheou

t-of-sam

ple(forecastin

g)performance

bytheaveragevalueof

themeanabsolute

error(M

AE),meanabsolutepercentage

error(M

APE

),androot

mean

squarederror(R

MSE

)

Year

Region

PNN

ARMA(2,1)

ARMA(2,1)-GARCH(1,1)

LSD

M

MAPE

MAE

RMSE

MAPE

MAE

RMSE

MAPE

MAE

RMSE

MAPE

MAE

RMSE

(×10

−2)

(×10

2)

(×10

2)

(×10

−2)

(×10

2)

(×10

2)

(×10

−2)

(×10

2)

(×10

2)

(×10

−2)

(×10

2)

(×10

2)

2010

NSW

3.42

932.97

664.11

736.87

116.14

218.33

566.77

576.05

258.22

395.48

734.83

016.36

28

QLD

2.89

121.72

522.57

945.83

973.52

214.69

485.77

053.47

884.65

384.40

022.65

903.63

09

SA6.88

661.01

461.36

9611

.004

01.70

562.38

4810

.894

01.68

682.35

7610

.535

01.58

982.18

21

TAS

3.88

520.44

480.59

446.67

500.78

381.03

626.62

280.77

801.02

895.03

340.58

000.74

52

VIC

3.99

852.27

983.47

588.02

464.70

956.38

227.90

534.63

806.31

007.30

614.22

915.61

66

2011

NSW

3.51

133.04

374.20

856.85

726.08

058.21

886.77

426.00

198.11

265.53

484.87

576.75

20

QLD

2.80

171.62

372.21

435.92

143.50

374.65

415.84

673.45

854.61

034.52

372.68

793.61

35

SA7.21

951.04

631.43

5411

.666

01.74

292.50

0811

.527

01.72

082.47

0810

.088

01.48

922.05

32

TAS

3.91

070.44

170.57

966.53

520.75

601.03

826.48

540.75

011.02

735.42

550.61

940.79

29

VIC

3.62

912.04

552.79

798.20

394.72

406.34

888.09

834.66

186.28

036.78

903.86

595.17

79

123

Page 25: Machine learning with parallel neural networks for …liny.csie.nctu.edu.tw/document/CSEM-Chen-Sun-Lin-2019.pdfElectricity demand (utilization or load1) forecasting is essential for

Machine learning with parallel neural networks for…

Table3

continued

Year

Region

PNN

ARMA(2,1)

ARMA(2,1)-GARCH(1,1)

LSD

M

MAPE

MAE

RMSE

MAPE

MAE

RMSE

MAPE

MAE

RMSE

MAPE

MAE

RMSE

(×10

−2)

(×10

2)

(×10

2)

(×10

−2)

(×10

2)

(×10

2)

(×10

−2)

(×10

2)

(×10

2)

(×10

−2)

(×10

2)

(×10

2)

2012

NSW

3.21

092.64

563.54

127.08

205.90

167.93

277.01

395.84

127.85

855.45

364.46

775.94

83

QLD

2.71

191.57

132.13

765.72

293.36

064.49

405.63

483.30

834.44

294.28

992.52

993.48

67

SA6.79

380.98

761.36

6812

.186

01.78

862.44

8812

.058

01.76

772.42

0910

.296

01.51

262.12

66

TAS

4.02

350.43

580.56

206.54

720.72

840.98

826.49

190.72

200.98

125.15

780.56

180.71

87

VIC

3.96

152.20

483.39

938.63

384.91

686.56

298.54

474.86

456.50

557.00

443.96

395.37

31

2013

NSW

3.68

802.89

794.17

837.16

995.78

307.90

137.09

525.71

987.83

255.71

954.55

726.17

75

QLD

2.65

671.50

312.14

975.69

243.24

854.31

605.61

033.20

064.26

774.16

622.36

823.22

57

SA8.10

391.12

831.60

3812

.204

01.70

782.38

7012

.131

01.69

692.37

1211

.116

01.57

252.19

92

TAS

4.21

550.47

180.63

096.57

670.75

240.99

766.52

820.74

640.98

895.60

820.63

250.80

94

VIC

4.24

002.28

743.27

319.64

495.33

877.22

569.55

015.28

607.15

117.87

584.34

905.79

70

2014

NSW

3.14

262.45

583.50

506.84

685.49

327.31

126.75

005.41

277.21

495.66

424.47

396.01

68

QLD

2.98

231.71

172.67

335.78

603.32

784.39

395.69

563.27

434.33

814.81

092.80

303.88

48

SA8.48

811.16

171.53

5212

.993

01.80

432.54

6612

.858

01.78

352.51

9611

.502

01.59

202.31

67

TAS

4.08

420.45

140.61

286.98

520.78

791.05

847.02

330.79

091.06

214.98

530.55

340.70

89

VIC

4.90

252.56

463.75

889.51

965.09

446.97

859.38

625.02

126.90

517.98

404.23

056.01

45

Mean

4.37

471.64

482.33

208.04

753.34

824.52

557.96

283.30

654.47

746.67

032.70

383.66

92

Variance

0.02

9374

.876

515

7.09

630.05

2938

2.65

9969

1.92

210.05

2037

2.32

5267

5.76

030.05

3524

3.27

2244

4.58

72

123

Page 26: Machine learning with parallel neural networks for …liny.csie.nctu.edu.tw/document/CSEM-Chen-Sun-Lin-2019.pdfElectricity demand (utilization or load1) forecasting is essential for

Y.-T. Chen et al.

Table4

Robusttestsmeasuredby

Kolmogorov–S

mirnov’smetric(K

S)andKuiper’smetric(×

102)of

thePN

Nwith

otheralternativemethods,that

is,ARMA(2,1),

ARMA(2,1)-GARCH(1,1),andlin

earseason

aldu

mmymod

el(LSD

M)forthein-sam

ple(m

onito

ring

/training

)andou

t-of-sam

ple(forecastin

g)performance

Region

PNN

ARMA

ARMA-G

ARCH

LSD

M

KS

Kuiper

KS

Kuiper

KS

Kuiper

KS

Kuiper

In-sam

ple

NSW

0.91

371.76

421.00

911.67

721.31

912.53

074.18

617.55

41

QLD

1.48

302.35

081.77

221.03

651.77

451.11

093.50

496.38

92

SA2.43

924.57

072.51

221.77

672.75

621.31

457.01

3311

.730

0

TAS

1.73

443.00

733.07

645.51

442.84

875.31

303.64

416.67

33

VIC

1.20

752.38

181.90

271.62

571.90

041.44

496.85

9312

.841

0

Mean

1.55

562.81

502.05

452.32

612.11

982.34

285.04

159.03

75

Var

0.33

781.15

670.61

233.26

040.43

623.05

963.05

959.12

99

Out-of-sample

NSW

2.08

572.94

408.35

478.35

598.18

628.18

744.94

737.68

44

QLD

4.23

455.36

439.19

119.19

598.79

138.79

603.76

095.80

87

SA6.76

4310

.038

09.49

4810

.557

79.28

4810

.332

37.47

9111

.257

0

TAS

2.15

123.57

745.11

466.71

744.89

636.40

905.07

787.79

94

VIC

3.44

404.27

0210

.285

010

.285

09.92

559.92

557.80

3013

.470

0

Mean

4.14

855.81

258.52

149.18

908.22

458.86

576.03

029.58

38

Variance

3.04

815.42

2213

.540

313

.559

112

.430

712

.506

12.82

646.80

17

123

Page 27: Machine learning with parallel neural networks for …liny.csie.nctu.edu.tw/document/CSEM-Chen-Sun-Lin-2019.pdfElectricity demand (utilization or load1) forecasting is essential for

Machine learning with parallel neural networks for…

time series data often show more complex seasonal patterns. For example, daily datamay have both a weekly seasonal pattern and an annual seasonal pattern. In our study,half-hourly data usually contains four types of seasonality: daily, weekly, monthly,and annual seasonal pattern. Most of the methods we have considered so far have notbeen able to deal with these complex seasonal issues. LSDM can only handle one typeof seasonality, and often assumes that frequencies take integer values.

To process such time series data, we propose the PNNs to process complex seasonaltime series. Thismethod allows us to specify all possible seasonal patterns, but also hasthe flexibility to deal with non-integer frequencies. Although this method is flexible,we do not need to include all frequencies that may be relevant. We can include onlythose frequencies that may appear in the data. For example, if you only have 180 daysof daily data, you can ignore the annual seasonality. If your data is a measurement ofnatural phenomena (for example, temperature), you may ignore weekly seasonality.

We shall point out that there exists sluggishness when PNNs will be used to trainfeatures illustrating self-similarity. Because the optimized objective function is com-plex, it will inevitably appear some flat areas when the neuron output is close to 0or 1. In these areas, the weight (i.e., ω) changes little. Therefore, the traditional one-dimensional search method cannot be used to find the step size of iteration, but theupdate rule of the step size must be given to the PNNs in advance.

5 Conclusion

Power demand forecasting is crucial in demand-side energy management and efficientmanagement of existing power systems. It is also needed for the optimization ofplanning additional power capacity based on the architecture of the IoE. In addition, arobust forecasting method is fundamental for accomplishing servitization of electricalpower firms. Based on the application of big data technology, we propose a novelquantitative method of parallel neural networks (PNNs) for robust monitoring andforecasting power demand. Through our approach, we generalize the implementationby using simulated data and real data fromAustralia. Based on our results, we concludethat a PNN is a robust algorithm that enriches the class of big data mining methodsapplied for forecasting power demand.

PNN provides a more compatible interface for future modification; for example, wecan improve PNN by introducing a more sophisticated decision function. In addition,future applications could be carried out with, for example, the proposed algorithm,which is expected to provide a significant improvement in the accuracy of modelingand forecasting, and thus one can use it in risk and quality management with big data.

In our study, we only focus on the parallel feed-forward networks and some workscan be done for parallelization of recurrent networks.More sophisticated bi-directionalinformation flows could be structured. In addition, we did not discuss the potentialchallenges from algorithms based on statistical machine learning. These algorithmscan be used to characterize the seasonality of power consumption in different timeframes. Still, more attention needs to be paid to the relative cost of over-fitting whenwe consider a deeper network setting.

123

Page 28: Machine learning with parallel neural networks for …liny.csie.nctu.edu.tw/document/CSEM-Chen-Sun-Lin-2019.pdfElectricity demand (utilization or load1) forecasting is essential for

Y.-T. Chen et al.

Acknowledgements This work was financially supported by (1) the Center for Open Intelligent Connec-tivity from The Featured Areas Research Center Program within the framework of the Higher EducationSprout Project by the Ministry of Education (MOE) in Taiwan, (2) the Ministry of Science and Technologyin Taiwan under MOST 106-2221-E-009-049-MY2, MOST 107-2218-E-009-020, and MOST 107-2823-8-009-004; and (3) InfoTech Frankfurt, Germany under DE32 024 5658.

References

Abedini, O., Zareinejad, M., Doranehgard, M., Fathi, G., & Ghadimi, N. (2019). Optimal offering andbidding strategies of renewable energy based large consumer using a novel hybrid robust-stochasticapproach. Journal of Cleaner Production, 215(1), 878–889.

Baesens, B., Setiono, R., Mues, C., & Vanthienen, J. (2003). Using neural network rule extraction anddecision tables for credit-risk evaluation. Management Science, 49, 312–329.

Baucells, M., & Borgonovo, E. (2013). Invariant probabilistic sensitivity analysis. Management Science,59, 2536–2549.

Cai, Y., Judd, K. L., Thain, G., & Wright, S. J. (2015). Solving dynamic programming problems on acomputational grid. Computational Economics, 45(2), 261–284.

Chen, Y., Luh, P. B., Guan, C., Zhao, Y., Michel, L. D., Coolbeth, M. A., et al. (2010). Short-term loadforecasting: Similar day-based wavelet neural networks. IEEE Transactions on Power Systems, 25,322–330.

Chen, Y. T., Sun, E.W., & Lin, Y. B. (2019). Coherent quality management for big data systems: A dynamicapproach for stochastic time consistency. Annals of Operations Research, 277, 3–32.

Chen, Y. T., Sun, E. W., & Yu, M. T. (2018). Risk assessment with wavelet feature engineering for high-frequency portfolio trading. Computational Economics, 52, 653–684.

Creel, M. (2008). Using parallelization to solve a macroeconomic model: A parallel parameterized expec-tations algorithm. Computational Economics, 32(4), 343–352.

Creel, M. (2016). A note on julia and mpi, with code examples.Computational Economics, 48(3), 535–546.Etemadi, H., Ahmadpour, A., & Moshashaei, S. M. (2015). Earnings per share forecast using extracted

rules from trained neural network by genetic algorithm. Computational Economics, 46(1), 55–63.Gao Wei, adn Darvishan, A., Toghani, M., Mohammadi, M., Abedini, O., & Ghadimi, N. (2019). Different

states of multi-block based forecast engine for price and load prediction. International Journal ofElectrical Power & Energy Systems, 104, 423–435.

Ghadimi, N., Akbarimjd, A., Shayeghi, H., & Abedini, O. (2018). Two stage forecast engine with featureselection technique and improved meta-heuristic algorithm for electricity load forecasting. Energy,161, 130–142.

Hill, T., O’Connor, M., & Remus,W. (1996). Neural network models for time series forecasts.ManagementScience, 42, 1082–1092.

Katris, C. (2019). Prediction of unemployment rates with time series and machine learning techniques.Computational Economics. https://doi.org/10.1007/s10614-019-09908-9.

Kiani, K. M. (2011). Fluctuations in economic and activity and stabilization policies in the cis. Computa-tional Economics, 37(2), 55–63.

Kim, Y., Street, W., Russell, G., & Menczer, F. (2005). Customer targeting: A neural network approachguided by genetic algorithms. Management Science, 51, 264–276.

Levendovszky, J., Reguly, I., Olah, A., & Ceffer, A. (2019). Low complexity algorithmic trading by feed-forward neural networks. Computational Economics, 54(1), 267–279.

Muresano, R., & Pagano, A. (2016). Adapting and optimizing the systemic model of banking originatedlosses (symbol) tool to the multi-core architecture. Computational Economics, 48(2), 253–280.

Piramuthu, S., Ragavan, H., & Shaw, M. (1998). Using feature construction to improve the performance ofneural networks. Management Science, 44, 416–430.

Ramyar, S., & Kianfar, F. (2019). Forecasting crude oil prices: A comparison between artificial neuralnetworks and vector autoregressive models. Computational Economics, 53(2), 743–761.

Saeedi, M., Moradi, M., Hosseini, M., Emamifar, A., & Ghadimi, N. (2019). Robust optimization basedoptimal chiller loading under cooling demand uncertainty. Applied Thermal Engineering, 148(5),1081–1091.

123

Page 29: Machine learning with parallel neural networks for …liny.csie.nctu.edu.tw/document/CSEM-Chen-Sun-Lin-2019.pdfElectricity demand (utilization or load1) forecasting is essential for

Machine learning with parallel neural networks for…

Siddharth, A., & Taylor, J. (2018). Rule-based autoregressive moving average models for forecasting loadon special days: A case study for france.European Journal of Operational Research, 266(1), 259–268.

Stasinakis, C., Sermpinis, G., Theofilatos, K., & Karathanasopoulos, A. (2016). Forecasting us unemploy-ment with radial basis neural networks, Kalman filters and support vector regressions. ComputationalEconomics, 47(4), 569–587.

Sun, E., Chen, Y., & Yu, M. (2015). Generalized optimal wavelet decomposing algorithm for big financialdata. International Journal of Production Economics, 165, 161–177.

Sun, E., & Meinl, M. (2012). A new wavelet-based denoising algorithm for high-frequency financial datamining. European Journal of Operational Research, 217, 589–599.

Sun, W., Rachev, S., & Fabozzi, F. (2007). Fractals or i.i.d.: Evidence of long-range dependence and heavytailedness from modeling german equity market returns. Journal of Economics and Business, 59,575–595.

Swann, C. A. (2002). Maximum likelihood estimation using parallel computing: An introduction to mpi.Computational Economics, 19(2), 145–178.

Tam, K., & Kiang, M. (1992). Managerial applications of neural networks: The case of bank failure pre-dictions. Management Science, 38, 926–947.

Venkatesh, K., Ravi, A., Prinzie, A., & Van den Poel, D. (2014). Cash demand forecasting in atms byclustering and neural networks. European Journal of Operational Research, 232, 383–392.

Wang, S. (1995). The unpredicability of standard back propagation neural networks in classification appli-cations. Management Science, 41, 555–559.

Wong, K., Xia,M., &Chu,W. (2010). Adaptive neural networkmodel for time-series forecasting.EuropeanJournal of Operational Research, 207, 807–816.

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published mapsand institutional affiliations.

123