neural network-based dynamical modelling for system

Neural Network-based Dynamical Modelling forSystem Prognostics and Predictions

Submitted in total fulfilment of the requirements of the degree of

Doctor of Philosophy

Mengqiu Tao

Faculty of Science, Engineering and Technology

Swinburne University of Technology

Melbourne, Australia

2019

This page intentionally left blank

Abstract

Modelling and prediction of dynamical systems have been challenging research

problems in various areas. The purposes of dynamical modelling are to capture

the system dynamics, extract useful information from noisy data, and predict

system states. This thesis focuses on developing efficient and robust neural mod-

elling algorithms and performing system prognostics and state predictions. To

enhance the dynamical modelling capability and learning efficiency, the neural

model structures and learning algorithms are studied in detail.

First, a non-linear autoregressive exogenous (NARX) network-based approach

for dynamical system modelling is explored. Unlike the conventional neural net-

work training methods, a much more efficient training algorithm is developed for

the NARX neural networks, based on the Monte-Carlo stochastic principle. To

enhance the capability and robustness of the dynamical modelling, the NARX

neural model with two different feedback loops is constructed and applied to

the prognostics of the NASA simulated turbofan engine system. The simula-

tion results show the superior performance of the developed neural model with

comparisons to several existed learning approaches.

Second, a novel neural learning scheme, called back-forward stochastic con-

volution (BFSC), is presented for dynamical modelling with sequential output

data. Inspired by the information extraction mechanism in human-being daily

life, a novel feature extraction algorithm is proposed with back-forward stochastic

convolution. Then, the developed learning methodology is applied to the extreme

learning machine (ELM) and echo state networks (ESNs). The validations of the

developed training scheme are conducted through comparisons between the BFSC

i

trained models and the original models with two applications, the Mackey-Glass

sequence predictions and NASA turbofan engine system prognostics.

Third, a neural learning-based sinusoidal structured modelling approach for

vibration signal modelling and gear tooth prognosis is investigated. According

to the signal characteristics, the model is constructed by a superposition of three

signal components induced by different system mechanisms. After that, the back-

propagation (BP) and the adaptive moment estimation (Adam) gradient descent

are adopted for the model training. A group of gear rig test datasets are used to

validate the developed modelling scheme.

ii

Acknowledgement

It is an enjoyable and important PhD journey for me to study at the Swinburne

University of Technology. There are no words that can express my gratitude to the

people who helped me during the last four years. I would like to firstly express

my greatest gratitude to Prof. Zhihong Man, my principal supervisor, for his

inspiring research supervision. I would also like to thank the other supervisors, Dr.

Jinchuan Zheng, and Dr. Antonio (Tony) L. Cricenti. The thesis couldn’t have

been done without the help of them. I would also like to extend my gratitude to

these great researchers who have provided so many good suggestions for improving

my work, Dr. Wenyi Wang from DST Australia, and Prof. Hai L. Vu from Monash

University.

I am deeply thankful to the Swinburne University of Technology for awarding

me the AutoCRC scholarship and providing me with such a comfortable study

environment. Special thanks to Molly Liu from the student administration, Victo-

ria Jandayan from FSET facilities team, Andrew Zammit from IT client services,

Hannah Nguyen and Casey Megna from research support office, and Krzysztof

Stachowicz from the robotic laboratory for their constant research support. I

would also like to thank the outstanding academics at Swinburne, Prof. Qin-

glong Han, Prof. Cishen Zhang, Prof. Romesh Nagarajah, A/Prof. Weixiang

Shen, A/Prof. Zhenwei Cao, Dr. Jiong Jin, Dr. Xianming Zhang, Dr. Xiaohua

Ge and Dr. Boda Ning. Thank them for the invaluable help for my PhD research.

I want to thank my group mates, Dr. Wenjie Ye, Dr. Linxian Zhi, Dr. Lingkai

Xing, Dr. Hong Du, Weiduo Zhang, Pengcheng Wang, Yew Wee Wong, Zhengyi

Shen, Zhangwei Xu, Xiaoxue Zhang, Qianfeng Zhu and Jiapeng Yan, who made

iii

my PhD life interesting and full of stories. I also want to thank my colleagues,

Chengda Lu, Derui Ding, Guowang Xu, Hongyu Sun, Ranjie Duan, Ada Hung

and others who helped me during the last several years.

I would like to say special thanks to the friends I met at CAPS Australia, the

lovely Ruosang Qiu, Yongqing Jiang, Ruohui Lin, Miaosi Li, Yang Wang, Sifei

Han, Longxiang Gao, Helly, Gouping Hu, Hang Gao, Shuang Yi, Chenglong Xu,

Baiqian Dai, Jing Shang, Tianxin Pan, Daniel Xu, Youyang Qu, Tieshan Yang,

Huashan Li, Zongjia Chen, Kate Wang, Danzi Song and many more. I learnt so

much from these friendly and vibrant people!

Last but not least, I want to express my deepest gratitude to my family,

especially my parents, and my landlords, Dr. Zhengming Xu and Dr. Ping Fu,

who have treated me as their son. Thanks so much for their invaluable support.

iv

Declaration

This is to certify that:

1. This thesis contains no material which has been accepted for the award to

the candidate of any other degree or diploma.

2. To the best of my knowledge, this thesis contains no material previously

published or written by another person.

3. For the work that is based on the joint research and publications, the rela-

tive contributions of the respective creator and authors are disclosed.

Mengqiu Tao, 2019

v

Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 Efficient Dynamical Neural Modelling Scheme . . . . . . . 3

1.1.2 Novel Learning Strategy for Neural Networks . . . . . . . 4

1.1.3 Investigations on Structured Learning . . . . . . . . . . . . 5

1.2 Challenges, Objectives and Contributions . . . . . . . . . . . . . . 5

1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Background 9

2.1 Dynamical System Modelling . . . . . . . . . . . . . . . . . . . . 10

2.1.1 Linear Dynamical Systems . . . . . . . . . . . . . . . . . . 11

2.1.2 Parallel and Series-parallel Models . . . . . . . . . . . . . 12

2.1.3 Nonlinear Dynamical System Modelling . . . . . . . . . . . 13

2.1.4 Summarized Dynamical Neural Model Structures . . . . . 15

2.2 Feed-forward Neural Network . . . . . . . . . . . . . . . . . . . . 17

2.2.1 Basic Elements of ANNs . . . . . . . . . . . . . . . . . . . 17

2.2.2 Single Hidden Layer Feed-forward Neural Networks . . . . 19

2.2.3 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3 Recurrent Structures in Neural Networks . . . . . . . . . . . . . . 23

2.3.1 The Jordan Networks . . . . . . . . . . . . . . . . . . . . . 24

2.3.2 The Elman Networks . . . . . . . . . . . . . . . . . . . . . 25

vii

2.3.3 The NARX Networks . . . . . . . . . . . . . . . . . . . . . 27

2.3.4 Dynamic Neural Network with Output Self-feedback . . . 28

2.4 Reservoir Computing . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.4.1 The ESNs . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.4.2 Echo State Property . . . . . . . . . . . . . . . . . . . . . 32

2.4.3 ESNs Variants . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.5 Multi-hidden Layered Neural Networks . . . . . . . . . . . . . . . 34

2.5.1 Learning Capability of Multi-layer Neural Networks . . . . 35

2.5.2 An Brief Overview of Deep Neural Networks . . . . . . . . 37

2.6 Neural Network Training . . . . . . . . . . . . . . . . . . . . . . . 39

2.6.1 Optimization Problem of Neural Network . . . . . . . . . . 40

2.6.2 Back-propagation . . . . . . . . . . . . . . . . . . . . . . . 41

2.6.3 Gradient Descent Variants . . . . . . . . . . . . . . . . . . 43

2.7 Randomized Training Scheme for Neural Networks Learning . . . 46

2.7.1 Monte Carlo-based Stochastic Approximation . . . . . . . 46

2.7.2 Extreme Learning Machine . . . . . . . . . . . . . . . . . . 48

2.8 Regularized Optimization Technique . . . . . . . . . . . . . . . . 50

2.8.1 Regularization Problem and Optimal Conditions . . . . . . 51

2.8.2 Tikhonov Regularization . . . . . . . . . . . . . . . . . . . 54

2.8.3 L1 Regularization and LASSO . . . . . . . . . . . . . . . . 56

2.9 Turbine Engine System Modelling and Prognostics . . . . . . . . . 59

2.9.1 Turbine Engine System Modelling . . . . . . . . . . . . . . 59

2.9.2 Turbine Engine System Prognostics . . . . . . . . . . . . . 62

2.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3 A NARX Network-based Approach for Dynamical System Mod-

elling with Application to Turbofan Engine Prognostics 65

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

viii

3.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 69

3.2.1 Dynamical Neural Modelling with Sequence Output . . . . 69

3.2.2 Training Approaches for Dynamical Neural Models . . . . 71

3.3 The NARX-DNN . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

3.3.1 Structure of The NARX-DNN with Multi-sequences Input

and Sequence Output . . . . . . . . . . . . . . . . . . . . . 72

3.3.2 A Simplified NARX-DNN for Training . . . . . . . . . . . 74

3.4 Randomization-based Training for The NARX-DNN . . . . . . . . 78

3.4.1 Non-linear Feedback Weights Estimation . . . . . . . . . . 78

3.4.2 Output Weights and Autoregressive Weights Optimization 81

3.5 Model Validations and Comparisons . . . . . . . . . . . . . . . . . 83

3.5.1 Turbofan System Health Index Evaluation . . . . . . . . . 84

3.5.2 System Prognostic and HI Prediction . . . . . . . . . . . . 87

3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

4 Learning from Back-forward Stochastic Convolution for Dynam-

ical System Modelling with Sequence Output 97

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98


4.2.1 Dynamical Neural Modelling with Sequence Output . . . . 100

4.2.2 Approximation with Stochastic Convolution Sampling . . . 103

4.2.3 The ESNs . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

4.3 The BFSC-based Learning Scheme . . . . . . . . . . . . . . . . . 106

4.3.1 BFSC-based Neural Modelling . . . . . . . . . . . . . . . . 106

4.3.2 Some Discussions on BFSC . . . . . . . . . . . . . . . . . 110

4.3.3 BFSC-ESNs . . . . . . . . . . . . . . . . . . . . . . . . . . 113

4.4 Performance Investigation . . . . . . . . . . . . . . . . . . . . . . 117

4.4.1 Mackey-Glass Sequence Prediction . . . . . . . . . . . . . 117

ix

4.4.2 Turbofan Engine System Prognostic . . . . . . . . . . . . . 120

4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

5 Neural Network Learning-based Sinusoidal Structured Modelling

for Gear Tooth Cracking Process 127

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128


5.2.1 Gear Mesh Vibration Signal Modelling . . . . . . . . . . . 130

5.2.2 Sinusoidal Structured Signal Model . . . . . . . . . . . . . 133

5.3 The Neural Network Learning-based Structured Sinusoidal Modelling137

5.4 Performance Investigation . . . . . . . . . . . . . . . . . . . . . . 142

5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

6 Conclusions and Future Work 149

6.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . 149

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

6.2.1 Investigations on Long Short-term Memory of the NARX

Type of Neural Network . . . . . . . . . . . . . . . . . . . 151

6.2.2 Kinematic Structured Neural Network for Dynamical Sys-

tem State Predictions . . . . . . . . . . . . . . . . . . . . . 153

6.2.3 A Neural Adaptive Feedback Scheme for Time-varying Non-

linear System Modelling . . . . . . . . . . . . . . . . . . . 154

Bibliography 156

List of Publications 182

x

List of Figures

2.1 Signals with different charicteristics: (a) vibration detected from

an industrial equipment; (b) gear mesh vibration signal. . . . . . . 11

2.2 Basic dynamical modelling and identification structures: (a) par-

allel model; (b) series-parallel model . . . . . . . . . . . . . . . . . 14

2.3 One artificial neuron . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4 The basic architecture of SLFNs . . . . . . . . . . . . . . . . . . . 20

2.5 Magnitude and phase of the frequency response of randomly gen-

erated convolution kernels. . . . . . . . . . . . . . . . . . . . . . . 23

2.6 Jordan networks (not all connections are shown). . . . . . . . . . 24

2.7 Elman networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.8 NARX neural network. . . . . . . . . . . . . . . . . . . . . . . . . 27

2.9 DNN with output self-feedback. . . . . . . . . . . . . . . . . . . . 29

2.10 The basic network structure of ESNs. . . . . . . . . . . . . . . . . 30

2.11 Information propagation of the feed-forward neural networks with

two hidden layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.12 The structure of ELM . . . . . . . . . . . . . . . . . . . . . . . . 49

2.13 Graphical explanation for the inequality constraints: (a) optimal

point is within the constrained region; (b) optimal point is on the

constrain boundary. . . . . . . . . . . . . . . . . . . . . . . . . . 54

xi

2.14 Different characteristics of the norm functions with one dimen-

sional input case: (a) L1 norm loss function L1(A) = |A|; (b) L2

norm loss function L2(A) = 1/2|A|2; (c) the derivative function of

L1 norm; (c) the L2 norm derivative function dL2(A)/dA = A . . 57

2.15 Structure of a simplified turbofan engine model . . . . . . . . . . 61

3.1 The network structure of NARX-DNN . . . . . . . . . . . . . . . 73

3.2 Structure of a simplified turbofan engine system . . . . . . . . . . 84

3.3 Right aligned run-to-failure sensor measurements of 100 engine

units from FD001: (a) Sensor 2; (b) Sensor 12. . . . . . . . . . . . 85

3.4 Visualization of the degradations of different engine units gener-

ated by ELM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

3.5 Engine early stage HI sequence predictions (repeated 20 trials) on

a testing engine unit by (a) the ELM; (b) the DNN with output

self-feedback; (b) the NARX-DNN. . . . . . . . . . . . . . . . . . 88

3.6 Engine intermediate stage HI sequence predictions (repeated 20

trials) on a testing engine unit by (a) the ELM; (b) the DNN with

output self-feedback; (b) the NARX-DNN. . . . . . . . . . . . . . 89

3.7 Engine failure stage HI sequence predictions (repeated 20 trials)

on a testing engine unit by (a) the ELM; (b) the DNN with output

self-feedback; (b) the NARX-DNN. . . . . . . . . . . . . . . . . . 90

3.8 Testing performance of 10-fold cross validation between ELM with

800 hidden nodes, NARX-BP (series-parallel model) with 50 hid-

den nodes (trained by Adam), and NARX-DNN with 300 hidden

nodes. The solid lines connect the averaged RMSE among 20 re-

peated simulations in the 10-fold cross validations. . . . . . . . . . 94

xii

4.1 Examples of nonlinear dynamical system modelling with sequence

output. (a) sequence predictions of the Mackey-Glass system; (b)

data visualization of a turbofan engine dynamical modelling prob-

lem: multi-variate sequence input and sequence output. . . . . . . 101

4.2 An ESNs with multi-variate sequences input, sequence feedback

and sequence output. . . . . . . . . . . . . . . . . . . . . . . . . . 105

4.3 Sketch of the BFSC training scheme: (a) back-forward stochastic

convolution; (b) BFSC-based four-layer neural network. . . . . . . 109

4.4 Visulization of Mackey-Glass sequence predictions: (a) plots of

repeated predictions by BFSC-ESNs with network structure 200-

400-200-100. (b) RMSE comparisons of the sequence prediction

(100 repeated trials) by four stochastic training based models on

the same training dataset and test sequence. . . . . . . . . . . . 118

4.5 The training time costs with different hidden nodes number for the

Mackey-Glass sequence predictions (100 trails). . . . . . . . . . . 120

4.6 Visualization of the NASA turbofan engine sensory measurements

and the corresponding evaluated system HI sequence. . . . . . . . 121

4.7 Examples of turbofan engine system HI sequence predictions at

different system stages by BFSC-ESNs with network structure 220-

400-100-10 (50 repeated trails): (a) initial system stage; (b) inter-

mediate system stage; (c) failure stage. . . . . . . . . . . . . . . . 123

4.8 Training and testing RMSE of four stochastic trained models for

the NASA turbofan engine prognostic problem, under different hid-

den nodes number settings. . . . . . . . . . . . . . . . . . . . . . . 124

4.9 The training time costs four stochastic trained models with differ-

ent hidden node number for NASA turbofan engine system mod-

elling (100 repeated trails). . . . . . . . . . . . . . . . . . . . . . . 125

xiii

5.1 The synchronous averaged signals yt at (a) G6b.30; (c) G6b.100; (e)

G6b.110; and the frequency spectrum at (b) G6b.30; (d) G6b.100;

(f) G6b.110. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

5.2 The convergence of the developed stochastic learning process with

the harmonic signal model for the signal at different revolution

time periods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

5.3 Signal approximation performance using harmonic model h(t), (a)

signal approximation at G6b.30; (b) first residual signal extracted

at G6b.30; (c) signal approximation at G6b.110; (d) first residual

signal extracted at G6b.110. . . . . . . . . . . . . . . . . . . . . . 145

5.4 The frequency spectrum of the residual signal δt at (a) G6b. 50;

(b) G6b. 100; (c) G6b. 110. (d) the gear evolution process from

G6b. 50 to G6b. 110, in terms of magnitude summation of the

frequency band with shaft order 20 ∼ 40. . . . . . . . . . . . . . . 146

5.5 (a) Energy distribution of the first residual signal δt extracted from

G6b.50 to G6b. 110; (b) the system load from G6b.50 to G6b. 110;

(c) energy distribution of the second residual signal ε(t) extracted

from G6b.50 to G6b. 110. . . . . . . . . . . . . . . . . . . . . . . 147

6.1 Sketch of Newtonian neural computing schemes: (a) the structure

of multilayer feed-forward neural networks; (b) a typical network

topological structure of the proposed DKNN; (c) an example with

multi-sequential inputs and sequence output. . . . . . . . . . . . . 153

6.2 Neural adaptive feedback scheme. . . . . . . . . . . . . . . . . . . 155

xiv

List of Tables

3.1 The pseudo-code of the proposed NARX-DNN . . . . . . . . . . . 83

3.2 Testing results of the ELM from 100 repeated prediction trials with

different hidden nodes number at different system stages . . . . . 92

3.3 Testing results of the DNN with output self-feedback from 100

repeated prediction trials with different hidden nodes number at

different system stages . . . . . . . . . . . . . . . . . . . . . . . . 92

3.4 Testing results of the proposed NARX-DNN from 100 repeated

prediction trials with different hidden nodes number at different

system stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

3.5 Comparisons on averaged testing RMSE and traing time cost . . . 93

4.1 Averaged testing results of four different models on Mackey-Glass

sequence predictions (100 repeated trails) . . . . . . . . . . . . . . 119

4.2 Averaged testing results of four different models on NASA turbofan

engine data with 100 repeated trails (the backward convolution

number is set as 100 for the BFSC based models is this test) . . . 124

xv

This page intentionally left blank

Chapter 1

Introduction

Systems in nature are beautiful and dynamic. Driven by force and impact from

the environment, some of which are observable, some are not, real-world systems

are always complex. Scientists and engineers have been working on the discovery,

modelling and analysis of dynamical systems and objects interactions in various

fields, e.g., molecular dynamics [1, 2], neuroscience [3], information technology [4],

sociology and demography [5, 6], robotics [7], language and cognitive processing

[8], etc. The developments of mathematical language, information technology and

computing power, have significantly improved human cognition level and brought

huge value to human society. It is very interesting and challenging to describe

natural or social objects and systems in an artificial way.

On the other hand, machine learning and artificial neural networks (ANNs)

have been developed rapidly during the last decade. With an appropriate train-

ing scheme, the ANNs are able to derive useful information from massive train-

ing data and handle various modelling problems. Especially, the deep learning

technique has been developed and successfully employed in various applications

including, image processing, system identification, and pattern classification. It

has been shown that the ANNs can be applied to successfully resolve very chal-

lenging tasks. In many cases, artificial intelligence can even perform tasks much

1

better than humans. Unlike conventional machine learning methods, which re-

quire pre-knowledge and careful engineering by hand, the state-of-the-art ANNs

based learning schemes are able to perform feature extraction directly from the

raw data with little engineering by hand.

This research aims at developing new neural network based learning schemes

for dynamical system modelling problems. Especially, a turbofan engine system

prognostic problem is used as the major study case in this thesis. To design robust

and highly efficient modelling processes, the new models are further developed

based on the existing neural structures and training methods. The simulated tur-

bofan engine dataset from NASA, dataset generated from Mackey-Glass system,

and a set of gear mesh vibration data collected from defence science and technol-

ogy group (DSTO) are used to test the performance of the developed models.

1.1 Motivation

Although the ANNs are successfully employed in image processing and computer

vision with excellent performance, there are still many research problems existing.

It seems that the success of deep learning is mostly dependent on the development

of computing facilities and big data. In contrast, the development of ANNs itself

is limited. ANNs modelling normally consists of two steps: constructing the net-

work model and finding the optimal solution for the neural network parameters.

The conventional neural network structure is constructed in the form of a fully

connected layer-by-layer network, which is known for its universal approximation

capability. However, the mathematical structure of the neuron and network is

developed mostly by experience and lack of theoretical support in terms of its

capability and efficiency. In addition, there are always hundreds of network pa-

rameters to be optimized by the training algorithm, e.g., gradient descent based

2

back-propagation (BP) which is the most commonly used method for network

training. The slow-convergence and local minima problems are big issues for

real-time applications.

On the other hand, compared to the great achievement of the ANNs on im-

age processing, much less research of neural network can be found on industry

data analysis and dynamical system modelling. It is noted that data and mea-

surements from different systems show totally distinct characteristics. Unlike an

image, which normally contains large information redundancy, the sensory mea-

surements of an industrial system are always smaller in size and contain much less

redundancy. Especially, for the dynamical system modelling, the system devel-

oping trend and the dynamical features of the system states are very important.

Thus, many dynamical system modelling problems are aimed at capturing the

dynamical features from system sequential measurements.

In light of that, the motivations of this research are summarized as three parts,

presented below.

1.1.1 Efficient Dynamical Neural Modelling Scheme

A good modelling scheme should be designed with high algorithmic efficiency

and low complexity. For real-time dynamical system modelling problems, the

solutions are normally required to be obtained accurately and on a timely basis.

However, the conventional neural network-based approach is always structured

with a number of neurons and takes a long time to find the optimal solution,

which motivates us to develop the dynamical neural modelling technique with

the following two aspects,

• Improving the neural network training time-efficiency.

Unlike the gradient descent based BP, randomization based training scheme

has been developed for neural network training, e.g. random vector func-

3

tional link (RVFL) network [9], and extreme learning machine (ELM) [10].

With randomly generated input weights and optimized output weights, the

single-hidden layer feed-forward networks (SLFNs) with sufficiently large

hidden node number can be trained as a universal approximator. Such

learning scheme is able to obtain the optimal solution in a very fast manner.

However, to achieve good training performance, the hidden node number

of the randomization based network is always designed to be very large,

which leads to high computational cost. It is noted that the hidden states

randomized generation is a feature extraction process, which is much less

structurally efficient compared to BP trained network.

• Improving the structural efficiency of the dynamical neural model.

The modelling capability and storage capacity are mostly dependent on the

size of the neural network. However, the modelling capability of the neural

network could also be affected by the training scheme and model struc-

ture. For dynamical system modelling with sequential input and output,

the recurrent and feedback structures of the network are worthwhile to be

investigated. As shown in [11, 12], the networks with sequential feedback

have shown significant advantages for dynamical system modelling.

1.1.2 Novel Learning Strategy for Neural Networks

Inspired by the biological neural networks, the ANNs are aimed at learning sys-

tem knowledge from training data. However, it is noted that the model struc-

ture and learning strategy of ANNs are totally different from the human brain.

Although the BP algorithm has shown effective performance on many neural

network-based applications and is still widely used nowadays, we can rarely find

the back-propagation learning process in biological neural networks. The learning

of human-beings is a complex process, which is still full of mystery. However, it is

4

very important to get some ideas from the biological information transformation

and human learning process.

In addition to the neural structures, we may introduce some basic strategies

to help the learning procedure of the neural network. In human education, a

good teacher could help the students learn faster and better. Similarly, we could

develop some novel training scheme to improve the learning of the neural network.

1.1.3 Investigations on Structured Learning

The neural structures of our brain have been developing since our birth [13]. The

connections between neurons or even the neurons themselves are changing when

we are learning. There is no doubt that the structure of the neural model should

be kept developing rather than static, although the current version of ANNs has

demonstrated very strong nonlinear approximation capability in various applica-

tions. It is noted that the ANNs still struggles to approximate some complex

signal, e.g. sinusoidal signal with a number of frequency components, which also

motivates us the look for the other types of structured learning.

The modelling capability of the ANNs with a finite number of nodes is limited.

In many cases, the ANNs also show very low structural efficiency on approxima-

tion. Thus, it is worthwhile to investigate modelling structures based on the

signal properties of the system.

1.2 Challenges, Objectives and Contributions

The problems of neural networks-based dynamical system modelling have been

studied for decades [21]. However, little research breakthroughs of this field can

be found in recent years. The main challenges are summarized as follows,

5

• Due to the non-linearity of ANNs, it is challenging to overcome the slow

convergence problem of the conventional dynamic neural network train-

ing. The low learning efficiency has limited the applicability of the neural

networks-based approaches in many cases.

• The structure design of the neural network is mostly experience-based and

lack of theory support. Thus, it is difficult to design a well-constructed

neural network and train the ANNs to capture the key dynamic features

from the noisy sensory data.

• Compared to the pattern classification tasks, the dynamical neural mod-

elling aims at deriving system dynamical features and predicting system

future evolution trends, which are much more challenging.

This thesis aims at developing efficient dynamical modelling schemes using

novel neural network learning techniques. The objectives of the dynamical mod-

elling are capturing the dynamical features from the input sensory sequences and

predict the system states and developing trends. Several different neural net-

works related or motivated learning techniques are proposed for different mod-

elling problems. The main contributions of the thesis are summarized as follows.

• A new non-linear autoregressive exogenous (NARX) network-based dynam-

ical neural model are developed with different sequence feedback structure,

which shows excellent performance for the turbofan engine system health

state prediction problem.

• Based on Monte-Carlo approximation, an effective training scheme is pro-

posed for the training of NARX network-based model, which achieves a

much faster training process, compared to the traditional BP algorithm.

• A novel back-forward stochastic convolution (BFSC) based learning scheme

has been proposed for the dynamical system modelling problem with se-

6

quence output, which achieves excellent performance in NASA turbofan

engine prognostics and Mackey-Glass Sequence Predictions.

• The BFSC scheme has been applied for ELM and echo state networks

(ESNs), the simulations results show that, by adding the backward stochas-

tic convolution process, the modelling capability of the neural network has

been significantly enhanced.

• A back-propagation based sinusoidal structured learning scheme is devel-

oped for gear prognosis and tooth-cracking process modelling. The simula-

tions on a gear rig dataset show the effectiveness of the developed structured

learning approach.

The developed neural networks-based modelling approaches can be employed

for various dynamical system modelling cases, e.g., financial time series predic-

tion, fault prognostics for mechanistic systems, molecular dynamic modelling, and

sociological data analysis. The investigated neural network structure and training

scheme may bring some new ideas for the above-mentioned applications.

1.3 Thesis Organization

As above-mentioned, the thesis is focused on developing efficient neural network

learning scheme for the dynamical modelling problems. Several neural network

based models and neural network learning based methods are developed and

investigated. The organization of the rest of the thesis is presented as follows.

• In Chapter 2, some necessary research background of the dynamical system

modelling problems are firstly introduced with a review of dynamical neu-

ral modelling structures. Then, the basic introduction of the feed-forward

neural network is presented. After that, the recurrent structures in neural

7

networks are comprehensively reviewed. As a special framework of recur-

rent neural networks (RNNs), the reservoir computing (RC) technique is

discussed. The ESNs model is also introduced in detail as an example

of RC. Then, a brief overview and discussions of the multi-hidden layered

neural networks are provided. Furthermore, the neural network training

schemes and regularization techniques for the optimization of the neural

networks are also presented.

• In Chapter 3, a new NARX network-based approach is developed for the

nonlinear dynamical system modelling with application to turbofan engine

system prognostics. A Monte-Carlo stochastic approximation based train-

ing scheme is developed for the NARX type of neural network. The de-

veloped model and learning scheme are then verified by using the NASA

turbofan engine dataset.

• In Chapter 4, a back-forward stochastic convolution (BFSC) based training

approach is proposed for the dynamical system modelling problems with

sequence output. The developed BFSC training scheme is then applied for

the ELM and ESNs models. The verifications of the proposed BFSC based

models are carried out to solve two dynamical modelling problems: the

NASA turbofan engine prognostics and Mackey-Glass sequence predictions.

• In Chapter 5, a sinusoidal structured learning approach is developed for gear

tooth cracking process modelling. Motivated by the neural network learn-

ing, the Adam stochastic gradient descent based back-propagation method

is applied for the training of the structured signal model. Then, a set of gear

rig test data is used to validate the proposed structured learning scheme.

• In Chapter 6, the summary of the thesis is presented, and some ideas for

future work are provided.

8

Chapter 2

Background

In this chapter, the necessary background on the neural network learning and

dynamical system modelling is presented. The main problems and challenges of

the dynamical system modelling are firstly discussed with a brief introduction of

the neural modelling methods. Then the feed-forward neural network is intro-

duced based on the fundamental network elements and theory. To address the

dynamical modelling problems, the recurrent structures in the neural network,

including the ESNs, are comprehensively reviewed. Then, the learning capability

of the neural network is discussed, followed by a brief review of the deep neu-

ral networks. Based on that, the traditional neural network training process is

formulated with the gradient descent based back-propagation. In addition, an

interesting category of neural network training, randomization based training is

formulated with some theoretical discussions. To obtain a robust training solution

for the neural network, the regularization techniques are also introduced. Fur-

thermore, the turbine engine system modelling is introduced as the major study

case of this thesis. Some existing techniques for the turbine system prognostics

are reviewed and discussed.

9

2.1 Dynamical System Modelling

As above-mentioned, systems in nature are beautiful and dynamic. Driven by

force and impact from the environment, some of which are observable, some are

not, the real-world systems are complex and always changing over time [14]. Fur-

thermore, the signals detected from different systems could show distinct charac-

teristics, as shown in Fig. 2.1. Thus, we can hardly use fixed equations to formu-

late and interpret a concrete dynamical system. A number of approaches have

been developed for the dynamical system modelling and identifications [15, 16, 17].

The main research challenges of the dynamical system modelling are summarized

as follows.

i. The data noise and system uncertainties make the time-related dynamical

features hard to be captured.

ii. In some real dynamical systems, the nonlinear system dynamics are nor-

mally unknown and time-varying, e.g. the chaotic phenomenon.

iii. In some real-time system identification cases which require a fast process

speed, the time efficiency of the system modelling approach is required.

In this section, the linear dynamical system (LDS) is first introduced. Then

two categories of dynamical modelling scheme are presented, which are parallel

and series-parallel models. After that, the nonlinear dynamical system modelling

problems are discussed, followed by a summary of the neural network based dy-

namical model structures.

10

(a)

(b)

Figure 2.1: Signals with different charicteristics: (a) vibration detected from anindustrial equipment; (b) gear mesh vibration signal.

2.1.1 Linear Dynamical Systems

A typical linear dynamical system can be formulated as follow,

yt = φ(y) = Ayt + b (2.1)

where, yt is a vector which represents the system states at time t, yt represents the

state change rate, which is determined by the current system states, parameter

11

matrix A, and vector b. It is noted that the linear dynamical mechanism can be

expressed as an affine function φ : R → R.

Furthermore, in the discrete case, the linear dynamical system with a single

input single output (SISO) can be formulated in the following difference equation

form.

yt+1 =n∑i=1

αiyt−i+1 +m∑j=1

βiut−j+1 (2.2)

where, yt+1 is the system output state at time t + 1, ut is the system external

input at time t, the system parameters [α1, α2, ..., αn, β1, β2, ..., βm]T represent the

system output dynamical dependences on the previous n system output states,

and recorded external input sequence with length of m.

The LDS based models are successfully applied in many fields including,

robotics [18], pattern classification [19], and time series analysis [20]. By lin-

earising and simplifying the nonlinear components, the LDS modelling based

approaches are efficient and tractable. These approaches are able to handle many

systems which are dominated by linear components. However, in many real-world

systems, the dependencies between the system inputs and outputs are always

nonlinear and complex. The linear models could always result in a big modelling

error. In addition to the construction of the model, system identification and

solution methods have also attracted a lot of attention from researchers.

2.1.2 Parallel and Series-parallel Models

For the identification and control of the dynamical systems, we can have two

different categories of parametric models which are the parallel model and series-

parallel model [21], the basic structures are shown in Fig. 2.2.

Taking the LDS in (2.2) as an example, we can have its parallel model as

12

follow,

yt+1 =n∑i=1


βiut−j+1 (2.3)

and the corresponding series-parallel model as,

yt+1 =n∑i=1


βiut−j+1 (2.4)

where, yt and yt represent the model output and real plant output at time t

respectively, ut is the system input at time t.

The Parallel model is designed to learn the dynamical information from the

previous output of the model and system input. In contrast, the series-parallel

model is armed to capture the dynamical feature from the recorded output se-

quence and input sequence of the real plant. According to the research work

in the past several decades, the parallel type of models always suffer from the

convergence problem. The conditions for the convergence of the parallel model

parameters still remain unknown [22]. In contrast, the series-parallel models have

exhibited very good performance.

2.1.3 Nonlinear Dynamical System Modelling

In contrast to the LSD, the nonlinear dynamical system (NDS) is more complex

to analyse and control [23]. The dynamical relationship function φ becomes a

nonlinear function which is normally unknown. Furthermore, the dynamical be-

haviour of the NDS is very hard to handle and predict, although the system is

fundamentally deterministic. Mackey-Glass system is a typical NDS, which was

firstly proposed in 1970s [24], to deal with the physiological disorders in dynam-

ical respiratory and hematopoietic diseases. It is known for complex dynamics

13

Figure 2.2: Basic dynamical modelling and identification structures: (a) parallelmodel; (b) series-parallel model

including chaos. The Mackey-Glass equation is shown as follow,

dytdt

= βyt−τ

1 + ynt−τ− ξyt β, ξ, n > 0 (2.5)

14

where, β is the system production constant rate, ξ is system constant decay rate,

yt−τ indicates the y variable at time t− τ , and n is the exponent number.

In discrete case, a representation of nonlinear dynamical system can be for-

mulated by the following difference equations [21]

xt+1 = Φ[xt,ut] (2.6)

yt+1 = Ψ[xt+1] (2.7)

where, ut ∈ Rn, xt ∈ RN , and yt ∈ Rm represent input vector, system state

vector, and output vector at time t respectively. Φ and Ψ are nonlinear maps.

Then, we have

yt+1 = Ψ[Φ[xt,ut]]

...

yt+1 = Ψ[Φ[...Φ[Φ[x1,u1],u2], ...,ut]] (2.8)

It is noted that the nonlinear functions Ψ and Φ are usually unknown, and the

state vector is also hard to access, which make it challenging for nonlinear dy-

namical system modelling.

2.1.4 Summarized Dynamical Neural Model Structures

To handle the complex non-linearity, the neural network based modelling ap-

proaches have been widely applied for the dynamical system identification and

control [21, 25]. Different neural network models have been developed to approx-

imate the nonlinear system dynamics. According to the literature [11, 12, 26], we

can have following generic forms of the dynamical neural models,

15

Model I:

yt+1 =l−1∑i=0

βiyt−i +N1

(ut, ..., ut−n+1

)(2.9)

Model II:

yt+1 = N2

(yt, ..., yt−l+1

)+

n−1∑i=0

αiut−i (2.10)

Model III:

yt+1 = N3

(yt, ..., yt−l+1, ut, ..., ut−n+1

)(2.11)

Model IV:

yt+1 = g(h1(t), ..., hN(t−N+1)

)(2.12)

hi(t) = fi

(hi(t−1);ut, ..., ut−n+1

)(2.13)

where, N1 : Rn → R, N2 : Rl → R, and N3 : Rm+n → R are neural network

functions, f : Rm → R and g : Rn → R are differentiable non-linear functions, βi

and αi are model parameters. In Model I, the output is linearly related to the past

values and non-linearly determined by the system inputs, while in Model II, the

output feedback mechanism is approximated by a neural network, and the system

input effect is designed in a linear form. In contrast, both the output feedback and

system input are designed as the neural network input in Model III. In Model IV,

the output state is determined by the internal state sequence [h1(t), ..., hN(t−N+1)],

which is dynamical related to its previous state hi(t−1), i ∈ 1, 2, ..., N , and external

input [ut, ..., ut−n+1].

It is noted that all the above-mentioned models are designed to capture differ-

ent dynamical features from the system input and feedback sequences. Different

neural network structures are constructed to express the non-linear system mech-

anism. Further discussions and comparisons of the above models will be carried

16

out in the following sections.

2.2 Feed-forward Neural Network

Machine learning and ANNs have attracted extensive attention from various fields

in the past ten years [27]. Based on the developments of the computing technique

and the rapid increase of data resources, the ANNs have been successfully em-

ployed in many areas including dynamical system modelling [21] image processing

[28], and system identification [29]. In this section, some basic elements of the

ANNs are first introduced. Then, the convolution technique is discussed for the

feature extraction of time-sequence data.

2.2.1 Basic Elements of ANNs

Figure 2.3: One artificial neuron

The idea of ANNs originates from the biological neural network [30]. With

multiple layers, where each layer is consist of many artificial neurons, the ANNs

is able to handle many complex modelling problems [31]. The inputs of one

neuron are normally from external inputs or the outputs of the previous layer.

As shown in Fig. 2.3, the inputs [b1, b2, ..., bn] are fed into a neuron by parameters

[w1, w2, ..., wn] which are called weights. Then, the summation of the weighted

inputs and a bias term is processed by a nonlinear activation function f : R → R.

17

The output of one artificial neuron thus can be written as,

yk = f(bk +n∑i=1

uiwi) (2.14)

It is noted that the nonlinear function are normally selected as a continued

and differentiable function, e.g. sigmoid function. However, for some specific

tasks, the function f may also be selected as a discontinued function e.g. rectifier

function, f(x) = max(0, x). Several widely used nonlinear functions are listed as

follows,

Switching function:

f(x) = 0 x < 0

f(x) = 1 x ≥ 0

(2.15)

Linear function:

f(x) = x (2.16)

Saturating linear function:

f(x) = 0 x < 0

f(x) = x 0 ≤ x ≥ 1

f(x) = 1 x > 1

(2.17)

Sigmoid function:

f(x) =1

1 + e−x(2.18)

Tangent Sigmoid function:

f(x) =ex − e−x

ex + e−x(2.19)

18

Positive linear function: f(x) = 0 x < 0

f(x) = x 0 ≤ x

(2.20)

From the perspective of cybernetics, the activation function is also known as

the transfer function which is used to express the input and output relationship

[32]. As shown in Fig. 2.3, we would like to consider the effects of the nonlinear

function together with the bias. The summation of the inputs is shifted by the bias

and then processed by the nonlinear function. Thus, the information propagation

can be controlled by the bias and nonlinear function. Further discussion will be

carried out in Chapter 4.

2.2.2 Single Hidden Layer Feed-forward Neural Networks

Based on the above-discussed neuron, we can construct a neural network with

multiple layers, where each layer consists of many neurons. In the feed-forward

neural network, the information is delivered from the neurons at one layer to the

neurons in the next layer, operated by different weights [33]. In this section, the

basic neural network model, the single hidden layer feed-forward neural networks

(SLFNs) is introduced.

The SLFNs is consist of three layers, which are input layer, hidden layer and

output layer. Assume that we have input vector x = [x1, x2, ..., xn], and the

SLFNs is designed with N hidden nodes and n output nodes as shown in Fig.

2.4, we can formulate the network as,

hj = f( n∑i=1

αijxi + b(1)j

)(2.21)

19

Figure 2.4: The basic architecture of SLFNs

yk = f( N∑j=1

βjkhj + b(2)k

)(2.22)

where, αij is ijth element of the input weights matrix α ∈ Rn×N , b(1)j and hj are

the bias and output of the jth hidden node, βjk is jkth element of the output

weights matrix β ∈ RN×n, b(2)k is the bias of the kth output node, and yk is kth

output.

The information in the input vectors are firstly mapped to the hidden space by

the operation of weighted summation, followed by a nonlinear function. The input

features are then represented in the hidden feature space. From the perspective

of feature extraction, the hidden states are aimed at capturing the key feature

of the input and eliminating noise and irrelevant information. After that, the

output layer is designed to perform the classification and regression. In addition

to the structure of the neural network, the technique of searching for the optimal

solution of the parameters also play a very important role in neural network

learning [10, 34].

20

2.2.3 Convolution

The convolution approach has been widely employed in neural network learning.

As a important category of ANNs, the convolutional neural networks (CNNs)

exhibits excellent performance in image classification [35], natural language pro-

cessing [36], and sequential data analysis [37]. The highlight of the CNNs is the

convolutional layer, where the discrete convolution is carried out in a restrict

region of the input signal, which is known as the receptive field in biological neu-

rons. In this section, the fundamentals of convolution for the sequence data is

presented and discussed.

Convolution is a basic mathematical operation defined as,

f(t) ∗ g(t) ,∫ +∞

−∞f(τ)g(t− τ)dτ (2.23)

or

f(t) ∗ g(t) ,∫ +∞

−∞f(t− τ)g(τ)dτ (2.24)

where, f and g are functions defined on the same domain. t and τ are variables.

The result function f∗g can be considered as an correlation result of two functions.

The discrete convolution of the samplings of two functions f ∗ g can be for-

mulated as

(f ∗ g)[n] =+∞∑i=−∞

f [i]g[n− i] (2.25)

or

(f ∗ g)[n] =+∞∑i=−∞

f [n− i]g[i] (2.26)

21

For the finite sequences convolution, we can have,

(f ∗ g)[n] =N−1∑i=0

f [i]g[n− i] (2.27)

where, [ f [0], f [1], ..., f [N − 1] ] is the convolution kernel, [ g[n], g[n− 1], ..., g[n−

N+1] ] is the input sequence, and (f ∗g)[n] is the convolution result, which can be

considered as a summarized information of the input sequence. By appropriately

selecting the convolution kernels, the convolution process exhibits different effects

to the input signal. Because of that, the convolution technique is widely used for

signal filtering, correlation analysis, and pattern recognition [38].

To further investigate the convolution process in (2.27), the frequency response

of the convolution kernel is further discussed. Assume that we have a general

exponential input signal as,

g[n] = Aej(ωn+ϕ) (2.28)

and the convolution kernel is selected as [b0, b1, ..., bN−1], then we can have the

output convolution result formulated as,

(f ∗ g)[n] =N−1∑k=0

biAej(ω(n−k)+ϕ) = Aej(ωn+ϕ) ·H(ejω) (2.29)

where, the frequency response of the convolution kernel can be formulated as,

H(ejω) =N−1∑k=0

bke−jωk (2.30)

It is noted that the computation in (2.29) is also known as a causal discrete

finite impulse response (FIR) filter. The amplitude and phase of the frequency

response of five randomly generated convolution kernels with N = 10 are shown

in Fig. 2.5. By selecting different filtering coefficients bk, the convolution kernels

22

Figure 2.5: Magnitude and phase of the frequency response of randomly generatedconvolution kernels.

show varying modification effects on the input signal with different radian fre-

quency. Thus, for a complex time sequence which consists of multiple frequency

components and noise, convolution kernels can be designed to perform the noise

filtering and feature extraction. However, it is very challenging to obtain appro-

priate convolution kernels for sequences, when the harmonic frequencies of the

signal are unknown.

2.3 Recurrent Structures in Neural Networks

Back to the 1980s, several types of RNNs were developed, e.g., Hopfield neural

networks [39], Jordan networks [40], Elman networks [41], and NARX networks

[12]. According to the literature, some existing recurrent propagating structures

are discussed in this section.

23

Figure 2.6: Jordan networks (not all connections are shown).

2.3.1 The Jordan Networks

The Jordan network is shown in Fig. 2.6. To perform the time-related system

behaviours, Context Units have been added to the network [40], see the blue

nodes in Fig. 2.6. The output of the context unit can be written as:

sj(t) = µsj(t−1) + yj(t−1) (2.31)

which can also be formulated as

sj(t) =t−1∑τ=1

µτ−1sjyj(t−τ) (2.32)

where, sj(t) is the output of jth context node at time t, which is determined

24

by its previous state sj(t−1), and its corresponding output feedback yj(t−1), µ is

the self-feedback weight of the context unit, yj(t−τ) is the output of jth output

node at time t − τ . The propagation mechanism between the output feedback

states and context unit states is formulated as (2.32). Then, the output of the

ith hidden note can be expressed as:

hi(t) = h( n∑k=1

xk(t)wki +m∑j=1

sj(t)w(n+j)i

)(2.33)

where, wki is the input weight which links the kth input node and ith hidden

node, w(n+j)i is the feedback weight which connects the jth context node and ith

hidden node, n and m are the input nodes number and output nodes number,

respectively, and h : R → R is a nonlinear activation function, e.g. the sigmoid

function h = 1/(1+e−x). Such recurrent connections are able to bring the network

richer memory of the system dynamic properties by the error-correcting weights

training process.

2.3.2 The Elman Networks

The structure of the Elman networks is shown in Fig. 2.7. Unlike the Jordan

networks, the context units in the Elman networks connect exclusively with the

hidden nodes of the network [41, 42]. The context node states and hidden node

outputs are changed as:

sj(t) = hj(t−1) (2.34)

hi(t) = h( n∑k=1

xk(t)wki +m∑j=1

hj(t−1)w(n+j)i

)(2.35)

Shown in (2.35), the hidden unit has a relationship with both the external

inputs and previous internal states. Without the connections from the output

25

Figure 2.7: Elman networks.

layer, the input of the context node is directly copied from the previous output

of the corresponding hidden node on a one-for-one basis, with the fixed weight of

one.

The Jordan networks and the Elman networks are designed with different

context propagation mechanisms, as shown in (2.31) and (2.34). Similarly, the

context outputs are then mapped to the feature space by feedback weights, ex-

pressed in (2.33) and (2.35). By optimizing the feedback weights under the pre-

defined recurrent connections, the networks are induced to learn the time-related

behaviours of the system. However, during the optimization procedure, the train-

ing problems of the feedback weights appear. The vanishing gradient has become

the main issue for training RNNs. Besides, the self-feedback of hidden states is

26

Figure 2.8: NARX neural network.

pre-fixed, which limits the models’ performance on dynamic system modelling.

From the perspective of time series theory, the recursive steps of context states

in (2.31) and (2.34) could be further investigated.

2.3.3 The NARX Networks

In addition to the Jordan and Elman networks, another type of RNNs is focused

on discovering the dynamic features from inputs and parallel output feedback

sequence. One of them is called the NARX neural networks [12, 43]. A single

output NARX neural network is illustrated in Fig. 2.8. The ith hidden state can

be presented as,

27

hi(t) = h( n∑k=1

xk(t)wki +m∑j=1

yt−jw(n+j)i

)(2.36)

where, yt−j is the network output at time t− j. Instead of the recursive feedback

used in the Elman networks, the NARX network is directly designed with a

sequence of output feedback processed in parallel.

It is noted that the NARX neural network based modelling methods have

achieved very good performance in time series prediction [44, 45]. Different to

the RNNs, the NARX network is designed with a direct connection from the past

output states, which make the model perform much better training performance,

compare to the RNNs which suffer from the vanishing gradient training problem.

However, the limitation of the NARX can be summarized as [46],

i. The effect on solving the gradient vanishing problem of the NARX network

is limited.

ii. The training process of the NARX network is always slow and inefficient.

iii. The delay number in the NARX network is pre-fixed, which make it suit for

the short-term prediction. How to deal with the long-term dependencies

by using the NARX network is still remain to be investigated.

2.3.4 Dynamic Neural Network with Output Self-feedback

In addition to the output feedback structure of NARX network, another neural

network designed with output self-feedba1ck was proposed in [11, 47] for crack

growth propagation modelling of aluminium alloy. Similar to the NARX networks,

the model is designed with memory of output feedback states. Different to the

NARX networks, the new DNN is designed with an output self-feedback tapped-

28

Figure 2.9: DNN with output self-feedback.

delay-line, as shown in Fig. 2.9. The model can be expressed as:

yt =N∑i=1

h( n∑k=1

xk(t)wki + bi

)βi +

n1∑j=1

yt−jβN+j (2.37)

where, βN+j is the jth output self-feedback weight, n1 is the feedback nodes

number. The DNN is optimized by a one-step training procedure based on the

Monte Carlo approximation and batch learning type of least square. The local

minima and slow convergence problems are avoided.

Compared to the networks with recursive types of feedback like the Jordan

networks and the Elman networks, the parallel output feedback sequence directly

provides the system evolution information, which contains rich dynamic features

and the evolving trend of output states. Such sequence feedback is more suitable

for time series analysis problems. In general, different types of feedback loops

29

are trained to record different system information and generate distinct effects on

the output, which are worthwhile to investigate further. In this paper, we mainly

focus on the parallel output feedback structures, which deal with the summarized

system information from the past inputs.

2.4 Reservoir Computing

Another interesting framework of recurrent network-based computing technique

is called reservoir computing (RC) [48, 49]. The key part of the RC is the large

reservoir which is generated from the random projections of the network inputs

and internal feedback. Only the readout part of the network is trained.

2.4.1 The ESNs

Figure 2.10: The basic network structure of ESNs.

As one of the most popular models of the RC, the echo state networks (ESNs)

was developed on 2001 [50] and has exhibited very good performance in vari-

ous applications, including wireless communication, time series prediction and

dynamical system modelling [51, 52, 53, 54]. It is reported that the ESNs can

achieve much better performance compared to the RNNs [55].

Shown in Fig. 2.10, the basic discrete ESNs model can be formulated as

30

follows,

xt = g(wxt−1 +winut) (2.38)

yt = f(wout(ut,xt)

)(2.39)

where, xt and xt−1 are the internal feature vectors at time t and t−1 respectively,

ut is the system input at time t, g = (g1, ..., gN) and f = (f1, ..., fm) are nonlinear

activation functions. The internal weights matrixw, and the input weights matrix

win which connects the input layer to the reservoir, are all randomly constrained.

Only the output weights matrix wout is optimized.

It is noted that the size of the reservoir is normally designed to be very large,

which contain high dimensional feature of the inputs and previous internal net-

work state. In many dynamical system modelling cases, the reservoir state in

(2.38) can also be constructed as,

xt = g(wxt−1 +winut +wfbyt−1) (2.40)

where, yt−1 is the model output at t−1, wfb is the feedback weights matrix which

connects the model output to the reservoir. By such design, more system dynam-

ical information could flow into the reservoir, which may enhance the capability

of the model for the dynamical system modelling.

In addition, the output of the ESNs in (2.39) can also be designed as,

yt = f(wout(ut,xt,yt−1)

)(2.41)

where, (ut,xt,yt−1) is the concatenation of the current system input, internal

state, and output feedback. By adding the output feedback loops in the output

part of the network, the robustness of the modelling can be significantly improved.

31

2.4.2 Echo State Property

It is important that the parameters of the ESNs are appropriately selected to

ensure that the network operates under stability conditions. Originally, the echo

state property (ESP) is defined in [50] as the stability property of the ESNs.

The ESP of the network is mainly determined by the algebraic properties of

the internal weight matrix and system input. Generally, to make the network

run under the ESP condition, the internal weights matrix is randomly sparsely

distributed, with spectral radius ρ(w) < 1.

Remark 2.4.1. For sequence input, the internal states of the ESNs can be con-

sidered as the stochastic convolution sampling results of the system input, internal

feedback, and output feedback. The sampling filters are randomly selected under

the ESP. The internal weight matrix can be generated as follow,

w = ws ·(δ/max(λws)

)(2.42)

where, ws ∈ RN×N is a sparsely uniformly distributed random matrix, λws is the

eigenvector of ws, and δ is a positive scalar which is normally selected as smaller

than one, to constrain the spectral radius of the internal weight matrix to be less

than unity. However, it is noted that ρ(w) < 1 is considered as a rather restrictive

condition for the stability of the network [56]. Further discussions about the ESP

can be found in [57, 58].

2.4.3 ESNs Variants

Many different structures have been designed for the ESNs model. In this section,

some major categories of the ESNs are presented.

ESNs with leaky integrator

32

For the continuous dynamical systems, the ESNs are designed with leaky inte-

grator units [59]. Then, the propagation of the internal state can be written

as,

x = C(− ax+ f(wu+winx+wfby)

)(2.43)

where, C > 0 is a time constant, a > 0 is the leaking rate, The changing rate of

the state can be numerically approximated as x = (xt − xt−1)/δ, where δ > 0 is

the step unit, and τ = Cδ. Then we have,

xt = (1− aτ)xt−1 + τ(f(wxt−1 +winut +wfbyt−1)

)(2.44)

where, τ ∈ (0, 1), aτ < 1, and 1 − aτ ∈ (0, 1). The current internal state xt is

dynamically determined by its previous state, system input, and output feedback.

As presented in (2.43), the model itself can be considered as a nonlinear dynamical

system.

ESNs with simplified internal connections

As presented in [60], different to the original ESNs, the ESNs is designed with

fixed and non-random internal topology. Three different simplified reservoirs

are constructed for the model. The topology of the reservoir can be simply

constructed by specially designing the internal feedback weights matrix w as

follows,

ws1 =

0 0 0 . . . 0 0

r 0 0 . . . 0 0

0 r 0 . . . 0 0

......

.... . .

......

0 0 0 . . . r 0

(2.45)

33

ws2 =

0 b 0 . . . 0 0

r 0 b . . . 0 0

0 r 0 . . . 0 0

......

.... . .

......

0 0 0 . . . 0 b

0 0 0 . . . r 0

(2.46)

and

ws3 =

0 0 0 . . . 0 r

r 0 0 . . . 0 0

0 r 0 . . . 0 0

......

.... . .

......

0 0 0 . . . r 0

(2.47)

where, r is the reservoir feed-forward connection weight, and b is the reservoir

feedback connection weight. The matrixws1, ws2, andws3 are the internal weight

matrices of delay-line reservoir, delay-line reservoir with feedback connections,

and simple cycle reservoir. It is shown that, the ESNs with simplified reservoir

topological structure is able to achieve very good performance on many time series

prediction problems [60].

2.5 Multi-hidden Layered Neural Networks

In addition to the shadow neural networks discussed in the previous sections, to

solve complex machine learning or data mining problems, neural networks have

been constructed with multiple hidden layers. In this section, some discussions on

the learning capability of the multi-layer neural networks are presented, followed

by a brief overview of deep neural networks.

34

2.5.1 Learning Capability of Multi-layer Neural Networks

The SLFNs discussed in Section 2.2.2 can be applied to solve many machine

learning problems with good performance. However, with the developments of

information technology, the problems we faced are becoming much more complex

and challenging. Especially, the problems associated with big data and complex

system mechanisms are out of the capability of the traditional shadow neural

networks.

The learning capability of three layers neural network is discussed in [10, 61].

It is noted that any N distinct samples approximation problem can be solved by

the three-layered feed-forward neural network. The brief of the theory is presented

as follows.

Theorem 2.5.1. Given an infinitely differentiable activation function f : R →

R, and N arbitrary distinct training samples (U ,Y ), where, U = [u1,u2, ...,uN ]T ,

and Y = [y1,y2, ...,yN ]T , ui ∈ Rn, and yi ∈ Rm, the jth output of N hidden

neurons is written as hj = [f(wjx1 + bj), ..., f(wjxN + bj)]T can be generated by

randomly selecting input weights w = [w1, ...,wN ], and bias vector b = [b1, ..., bN ]

within certain interval and distribution, where wj ∈ Rn and bj ∈ R, the outputs

of N hidden nodes then become H = [h1,h2, ...,hN ]. The feature matrix H is

invertible, and ∃β ∈ Rm×N , for ∀ ε ∈ R+, we can have ‖βH − Y ‖< ε.

The proof of the above theorem can be found in [61, 62], which is also briefly

presented as follows.

Proof. Since the training samples are distinct, and the input weights are randomly

selected, we can assume that wjxi 6= wj′xi, for ∀i ∈ 1, 2, ..., N , and ∀j 6= j′. We

can write the jth column of the feature matrix H as a function of bias bj,

h(bj) = [f(wjx1 + bj), ..., f(wjxN + bj)]T (2.48)

35

where, bj ∈ (b0, b0 + δ), and (b0, b0 + δ) is the randomization interval of the bias,

which belongs to R.

We can assume that the matrix H is not full-rank, e.g. rank(H) = N − 1,

which means that we can have a vector c ∈ RN to be orthogonal to the subspace

of dimension N − 1. We can also write this assumption as,

(c,h(bj)− h(b0)

)=

N∑i=1

ci · f(wjxi + bj)− c · h(b0) = 0 (2.49)

Assume the cN 6= 0, we can have,

f(wjxN + bj) = − 1

cN

N−1∑i=1

ci · f(wjxi + bj) + c · h(b0)/cN (2.50)

Let αi = −ci/cN , z = c · h(b0)/cN , di = wjxi, σi = di − dN , where di 6= di′ for

∀i 6= i′, b ∈ [b0 + dN , b0 + σ + dN ], we can reformulate (2.50) as

f(b) =N−1∑i=1

αi · f(b+ σi) + z (2.51)

Since the nonlinear function f is infinitely differentiable, we can easily obtain the

derivatives of the function with respect to b as follows,

f′(b) =

N−1∑i=1

αi · f′(b+ σi) (2.52)

f (2)(b) =N−1∑i=1

αi · f (2)(b+ σi) (2.53)

...

f (l)(b) =N−1∑i=1

αi · f (l)(b+ σi) (2.54)

where, f (l)(b) is the lth derivative of the function with respect to b. It is noted

that, there are only N real coefficients α1, α2, ..., αN−1; z. Thus, when l > N , the

36

function group (2.51) to (2.54) can not reach a solution, which is a contradiction

to the assumption (2.49). Thus we can have N ≥ rank(H) > N − 1, which

means the feature matrix is invertible. For ∀ε ∈ R+, we can have ‖βH−Y ‖< ε,

where β = Y H−1, H−1 is the inverse of the feature matrix H .

Remark 2.5.1. As addressed in Theorem 2.5.1, any approximation problems

with N distinct samples can be theoretically handled by a three-layered neural

networks with N hidden nodes. Thus, the capability and storage capacity of the

SLFNs mostly rely on the size of the hidden layer. In practical applications, the

number of training samples is always very large. Due to the computational cost,

the neural networks is always designed with a much smaller number of hidden

nodes to perform the modelling. In many complex tasks with large datasets, the

three-layered networks can hardly solve the problem.

Further study was carried out on the modelling capability of the two-hidden

layers feedforward networks (TLFNs) [61]. It shows that, approximation prob-

lems with N distinct samples and m outputs, can be solved by a TLFNs with√(m+ 2)N + 2

√N/(m+ 2) nodes in the first hidden layer, and m

√N/(m+ 2)

neurons in the second hidden layer [62]. The theoretical analysis can be found

in [62]. Thus, the modelling capability of TLFNs with overall√

(m+ 2)N +

2√N/(m+ 2) + m

√N/(m+ 2) = 2

√(m+ 2)N hidden nodes is equivalent to

the SLFNs with N hidden nodes. In most of practical applications, when N >>

m ≥ 1, and N >> 2√

(m+ 2)N , the TLFNs can perform much stronger mod-

elling capability, compared to the SLFNs with the same hidden nodes number.

2.5.2 An Brief Overview of Deep Neural Networks

The discussions in the above section are proved in many practical applications.

With the development of computers, the neural networks is designed much larger

and deeper. Since 2006, deep learning (DL) based neural networks has been

37

developed and improved rapidly with the help of unsupervised learning strategy

[63]. Several categories of deep neural networks have been proposed with different

training schemes, e.g., auto-encoders [64], deep belief networks [65], and deep

convolutional neural networks (CNNs) [28, 66]. The deep neural networks can

discover much higher level features with proper training [67, 68].

In 2006, Hinton et al. firstly showed a successful training approach of a many-

layered feed-forward neural network, with pre-training in an unsupervised man-

ner, then fine-tuning by BP [63]. More details for the training process are pre-

sented in [65]. The proposed multi-layered feed-forward neural network is named

as deep belief networks (DBNs), where each layer is designed as a restricted

Boltzmann machine. The developed model has been demonstrated with very

good performance in pattern classification and image processing [69, 70].

Another popular deep neural networks is called CNNs. Discussed in the above

section, the convolution technique has been applied for in the neural network

learning as a convolution layer [37]. Different to the fully connected neural net-

works, e.g. multilayer perceptron, the convolutional filters are only designed to

operate within a restricted region. And a parameter sharing scheme is always

applied in the network training. In addition to the convolutional layer, the CNNs

is normally constructed with pooling layer, RELU layer, fully connected layer,

and loss layer [71]. It is noted that some of those layers are mainly designed for

problems with large information redundancy input. The network is constructed

to eliminate unnecessary information and capture the key feature for the final

decision making.

Although the deep learning and deep neural network have achieved very good

performance in many challenging applications, The slow training problem, diffi-

cult hyper-parameters selection, and high computational cost are still very big

issues for practical applications [72]. Furthermore, it is also important to go back

to the basic structure of the neural network. Not only the neurons but also the

38

connectivities between them, are worthwhile to be further investigated [73].

Figure 2.11: Information propagation of the feed-forward neural networks withtwo hidden layers.

Remark 2.5.2. A feed-forward neural networks with two hidden layers is shown

in Fig. 2.11. It is noted that the connections between two network layers are

drawn in a biological way. In addition to the fully-connected layers in tradi-

tional shadow neural networks, we would like to consider more architectures for

the networks. In biological neural networks, the connectivity of neurons is to-

tally different to the ANNs, which could motivate us to develop a more efficient

information propagation way for neural computing.

2.6 Neural Network Training

In addition to the structure of neural networks, the optimization techniques also

play a very important role in neural network based learning [74]. In this section,

the general optimization problems of the ANNs are firstly presented. Then, the

most popular training algorithm, back-propagation, is formulated and discussed.

Furthermore, the variants of the gradient descent are also reviewed.

39

2.6.1 Optimization Problem of Neural Network

A general neural network model can be formulated as,

y(u) = N (u;α, b(1);β, b(2)) (2.55)

which can also be formulated as a concatenation of two sub-networks.

x(u) = N1(u;α, b(1))

= f(αu+ b(1)) (2.56)

y(x) = N2(x;β, b(2))

= f(βx+ b(2)) (2.57)

where, f : R → R is a non-linear activation function, u ∈ Rn×1 is the input

vector, x ∈ RN×1 is the hidden state vector, N is the hidden nodes number,

y ∈ Rm×1 is the output vector, α ∈ RN×n is the input weights matrix, β ∈ Rm×N

is the output weights matrix, b(1) ∈ RN×1 and b(2) ∈ RN×1 are the first and

second bias vector, respectively.

The training process of a neural network may vary. In this section, we con-

sider the general problem of traditional supervised approaches. Given the train-

ing samples (U ,T ), where, U = [u1,u2, ...,uN ], ui = [u1i, u2i, ..., uni]T , and

T = [t1, t2, ..., tN ], ti = [t1i1, t2i, ..., tmi]T , we can have the following optimization

problem.

min1

N

N∑i=1

V(N (ui;α, b

(1);β, b(2)), ti

)(2.58)

where, V(·) is the objective function which is used to show the distance or cor-

relation between the output of the neural network N (ui;α, b(1);β, b(2)) and the

target ti. To perform a least square estimation, the objective function can be

40

defined as a norm function [75]. The training problem can be designed as,

min1

2N‖N (U ;α, b(1);β, b(2))− T ‖2 (2.59)

The neural network is aimed to learn the optimal solution of the parameters

{α, b(1);β, b(2)} to make the model generate the same outputs to the training

targets. However, how to design the objective function is remain an open re-

search question. It is noted that the target function is not limited to be designed

as a norm function or likelihood the function. Practically, the objective func-

tion should be designed to comprehensively represent the task requirement, while

the computational complexity should also be considered. In this following sec-

tion we still focus on the norm based target function to introduction some basic

approaches of the neural network training.

2.6.2 Back-propagation

Developed in the 1980s, BP is the most popular training algorithm for the ANNs

[76, 77], which is used to backwards propagate the derivative and obtain the

searching direction of the parameters [78].

Since a general neural network is normally associated with a large number of

parameters and nonlinear operations, it is very challenging to obtain an optimal

solution. Especially, for the multi-layered neural networks, it is hard to tune

the weights in the initial layers. Because of that, the BP chain rule has been

developed to train the network.

Consider the neural network model in (2.56) with the training samples (U ,T ),

we can have the BP training process formulated as follows.

1. Randomly initialize the weights and bias vectors {α, b(1);β, b(2)}.

2. Generate the model output by feeding the training inputs to the neural

41

network, which can be calculated as,

X = N1(U ;α, b(1)) = f(αU + b(1)) (2.60)

Y = N2(X;β, b(2)) = f(βX + b(2)) (2.61)

3. Calculate the cost function as

L(α, b(1),β, b(2)) =1

2N‖Y − T ‖2 (2.62)

4. Calculate the derivatives for each parameters as,

∂

∂βL =

∂

∂YL · ∂

∂βN2 (2.63)

∂

∂b(2)L =

∂

∂YL · ∂

∂b(2)N2 (2.64)

∂

∂αL =

∂

∂YL · ∂

∂XN2 ·

∂

∂αN1 (2.65)

∂

∂b(1)L =

∂

∂YL · ∂

∂XN2 ·

∂

∂b(1)N1 (2.66)

5. Perform the gradient descent with step size γ to tune the parameters as,

β = β − γ ∂

∂βL (2.67)

b(2) = b(2) − γ ∂

∂b(2)L (2.68)

α = α− γ ∂

∂αL (2.69)

b(1) = b(1) − γ ∂

∂b(1)L (2.70)

6. Repeat 1 ∼ 5 until the error converges to zero or a small number.

Remark 2.6.1. The BP based approaches have achieved very effective training

process for both shadow and deep neural network. However, many training prob-

42

lems have been identified during the applications of the BP based methods, which

are summarized as follows,

i. Since the back-propagated derivative only represents the local searching di-

rection, the BP based training algorithms always suffer from the slow con-

vergence problem.

ii. The BP could be stuck at the local minima, and generate a poor solution.

iii. The learning step parameters are hard to be tuned, which could significantly

affect the training result.

iv. The vanishing gradient problem may occur, when the BP based scheme is

used to train the RNNs.

2.6.3 Gradient Descent Variants

In order to improve the time efficiency of the BP algorithm, several gradient de-

scent approaches have been developed to accelerate the convergence process [79].

To simplify the formulation in this section, the batched gradient descent step in

(2.67) is taken as an example.

Stochastic gradient descent

Similar to (2.67), the stochastic gradient descent can be formulated as [80],

βt = βt−1 − γ∂

∂βLi(βt−1) (2.71)

where, βt is the parameter matrix at iteration time t, the gradient ∂

∂βLi(βt−1)

is calculated based on ith training sample which can be randomly selected. The

algorithm is normally designed to sweep through all the training samples several

times until reaching convergence.

43

Stochastic gradient descent with momentum

In order to improve the gradient descent convergence performance, a mo-

mentum term has been added in gradient descent learning algorithms [76]. The

parameter update strategy of momentum gradient descent can be formulated as

[81],

vt = γvt−1 + η∂

∂βLi(βt−1) (2.72)

βt = βt−1 − vt (2.73)

where, vt is the calculated descent vector at time t, which is related to its cur-

rent gradient and previous descent steps. With the recursive calculation, all the

previous gradient states have been recorded with the discount rate (fraction) γ.

Thus, the gradient descent are changed to be a dynamical process with momen-

tum effect from the previous gradients.

Nesterov accelerated gradient

Based on the momentum gradient descent, a ”smarter hill-roll-down ball”

has been developed [82], which is called Nesterov accelerated gradient descent

(NAGD). Instead of calculating the gradients of the cost function with respect

to the current parameters, the NAGD performs the gradient at the evaluated

parameter future positions. The update steps can be formulated as follows:

vt = γvt−1 + η∂

∂βLi(βt−1 − γvt−1) (2.74)

βt = βt−1 − vt (2.75)

Adaptive moment estimation (Adam)

Adam is one of the state-of-the-art moment based adaptive gradient descent

methods for stochastic optimization [83]. The algorithm can be summarized as

44

follows,

i. Randomly initialize the parameter matrix β within a small range (−σ,+σ)

as β0, iteration index t = 1, α1 = 0.9, α2 = 0.999, ε = 10−8

ii. Initialize the 1st momentum matrix v0 = 0, and the second momentum

matrix m0 = 0

iii. Update the gradient matrix and momentum matrices as follows,

gt =∂

∂βLi(βt−1) (2.76)

vt = α1vt−1 + (1− α1)gt (2.77)

mt = α2mt−1 + (1− α2)g2t (2.78)

iv. Calculate the estimation of the bias-corrected first momentum term as,

vt = vt/(1− αt1) (2.79)

and the bias-corrected second raw momentum estimate,

mt = mt/(1− αt2) (2.80)

v. Update the parameter as,

βt = βt−1 −η√

mt + εvt (2.81)

vi. Update the iteration index t = t+ 1.

vii. Repeat 3 ∼ 6 until reaching the convergence condition.

Remark 2.6.2. In many cases, Adam could achieve much faster training speed,

45

compared to the traditional stochastic gradient descent. However, it is found that

even worse solution could be obtained by Adam in some cases. Many studies have

been carried out to address this issue [84, 85]. Several different approaches are

developed based on the Adam. However, the performance of the adaptive gradient

learning scheme is still lack of theoretical support. The performance may vary,

due to different data properties of training datasets.

2.7 Randomized Training Scheme for Neural Net-

works Learning

To avoid the slow-convergence and local minima training problems of the tra-

ditional BP based neural networks training, the randomization based training

scheme has been developed as an alternative training scheme for the neural net-

work learning [86, 87, 88]. In this section, the Monte Carlo based stochastic ap-

proximation approach is firstly presented, which suggests that the neural network

with sufficiently large hidden nodes number is able to achieve the approximation

of any continuous function. After that, one of the most popular randomization-

based methods, extreme learning machine (ELM), is discussed.

2.7.1 Monte Carlo-based Stochastic Approximation

A limit-integral representation of a continuous function f ∈ C(In) can be formu-

lated as

f(x) = limL→a

∫V

T [f(w)]Gw,L(x)dw (2.82)

where, L is a scalar parameter, a is a real number, V is the domain of the

parameter w, T is an operator, and G is an activation function.

46

We can then obtain the following integral by approximating the limiting value

with l ≈ a.

f(x) ≈ fV,w,l(x) =

∫V

T [f(w)]Gw,l(x)dw (2.83)

With the Monte-Carlo method, an estimate of the right side of (2.83) is attained

as, ∫V

T [f(w)]Gw,l(x)dw ≈|V |N

N∑i=1

T [f(wi)]Gwi,l(x) (2.84)

Let βi = ( |V |N

)T [f(wi)], i = 1, ..., N , we can have fN,w,l(x) as the functional

representation of f .

f(x) ≈ fN,w,l(x) =N∑i=1

βiGwi,l(x) (2.85)

The estimation error of the Monte-Carlo approximation on any compact set

K, K ⊂ In is denoted as

ε2MC(fV,w,l, fN,w,l) = E

∫K

[fV,w,l − fN,w,l]2dx

= E

∫K

[ ∫V

T [f(w)]Gw,l(x)dw − |V |N

N∑i=1

T [f(wi)]Gwi,l(x)]2dx (2.86)

As discussed in [9, 89], the approximation error εMC in (2.7.1) will converge

to zero as N → ∞ in the manner of C/√N . In other word, the representation

fN,w,l(x) in (2.85) is a universal approximator, when the vectors number N is

large enough.

Remark 2.7.1. As a special case of this method, for any continues function

47

f ∈ C(In), an RVFL representation can be defined as

fN(x) =N∑i=1

βig(wixT + bi) (2.87)

where, bi ∈ R and wi ∈ Rn can be randomly generated, g : R → R is a non-

linear activation function. It is proved that the RVFL net is also a universal

approximator, when the hidden nodes number N is large enough [9, 33]. Because

of the stochastic training, the RVFL net is known for its fast learning process.

However, due to the stochastic approximation, the RVFL network tends to require

much more hidden units, compared to the conventional methods.

2.7.2 Extreme Learning Machine

The ELM has been developed since 2004 [90, 91]. As shown in 2.12, the original

ELM, proposed by G. Huang [10], was designed as an SLFNs. The input weights

and hidden biases of the ELM are randomly generated and fixed within a certain

range. Only the output weights of the network are optimally trained. Known for

its extremely fast training speed, the ELM has been successfully applied in various

fields, including, time series prediction [92, 93], pattern classification [94, 95], and

image processing [96].

The main steps of the ELM algorithm are presented as follows.

i. Randomly generate a input weights matrix w ∈ Rn×N , and a vector b ∈

RN , which are constrained within a range [−σ,+σ], N is the hidden nodes

number.

ii. Given N training samples (U ,Y ), where U ∈ RN×n, and Y ∈ RN×m,

based on the generated input weights and bias, we can obtain the hidden

state matrix as

H = [h1,h2, ...,hN ]T (2.88)

48

Figure 2.12: The structure of ELM

where,

hi = g(uiw + b) (2.89)

iii. Obtain the optimal solution of the output weights matrix as,

β = H†Y (2.90)

where, H† is the pseudo-inverse of the hidden state matrix H . Then the

trained neural network becomes,

yi = g(uiw + b)β (2.91)

The theoretical discussions of the modelling capability of the ELM model have

been presented in Theorem 2.5.1. It is noted that the ELM model with N hidden

49

nodes is able to solve any N patterns classification problems. In many practical

applications, the training pattern number may be very large. To achieve a good

training performance, the hidden nodes number of the ELM is always designed

to be much larger than the traditional BP trained neural networks.

To improve and further investigate the learning capability of the ELM, many

methods have been developed based on the randomization training scheme [33, 97,

98]. Benefit from the fast training speed, fast online sequential learning schemes

have been developed based on the ELM [99]. The online sequential ELM (OS-

ELM) can be trained with chunk-by-chunk data with varying chunk length. Fur-

ther research of the OS-ELM can be found in [100, 101]. In addition, the training

scheme of the ELM has been further studied by incorporating with different tech-

niques, e.g., the evolutionary optimization [102, 103], sparse coding [104], and

incremental methods [105, 106]. Furthermore, the ELM learning scheme has

been extended for neural networks with deep architecture [107, 108, 109]. The

modelling capability of the ELM has been significantly improved.

2.8 Regularized Optimization Technique

The optimization techniques have been widely employed in various applications,

including mechanical system design [110, 111], water distribution networks opti-

mization [112], and satellite constellation system modelling [113]. For the mathe-

matical modelling of a real-world system, the optimization process mainly consists

of two parts: construction of objectives and constraints, and searching for optimal

solutions [114]. In many data-driven based modelling cases, the optimization of

the model is always associated with regularization to obtain robust and accurate

solutions for the modelling [115].

Similarly, during the neural network training, the ANNs based models are

50

aimed to learn the system mechanisms by approximating the input-output rela-

tionships using the training data. However, the data collected from the real world

systems always contain noises, disturbances and uncertain components. Because

of that, the purely mathematical approximation schemes may lead to poor train-

ing results. To avoid overfitting and obtain a generalized solution, there are

several different approaches for neural network training, e.g., dropout [116], data

augmentation [117], and early stopping [118]. Those regularization techniques

are designed from different perspectives: model connectivity, data features, and

training process, for specific neural network learning tasks. In this section, a gen-

eral regularization problem is considered. The norm function based regularization

techniques for neural network learning is discussed.

2.8.1 Regularization Problem and Optimal Conditions

The classical regularization theory can be traced back to 1960s when the math-

ematical basis of the regularization was developed by Tikhonov [119]. The key

idea of the classical regularization is to minimize the cost function as well as to

reduce the functional complexity [75, 120]. In this section, the formulation of

the regularized neural network training problem is firstly presented. Then, the

optimal conditions for the constrained optimization problem are discussed.

Given N training samples (U ,T ), where, U = [u1,u2, ...,uN ], ui ∈ Rn, is

a set of training input, and T = [t1, t2, ..., tN ], ti ∈ Rm, is the corresponding

training target, we can construct a regularized minimization problem for the

training of a general neural network as,

min1

2

∥∥∥N (U ;α, b(1);β, b(2))− T∥∥∥2 + λF(α, b(1);β, b(2)) (2.92)

where, N (U ;α, b(1);β, b(2)) represents the output of the neural network with

input U , the objective function is designed as the L2 norm of the distance be-

51

tween the model output and training target, λ is the regularization factor, and

F(α, b(1);β, b(2)) is a function which is design to regularize the neural network

parameters {α, b(1);β, b(2)}.

The regularized neural network training objective can also be formulated in

the form of a constrained optimization problem as,

min1

2

∥∥∥N (U ;α, b(1);β, b(2))− T∥∥∥2 (2.93)

s.t. F(α, b(1);β, b(2)) ≤ σ (2.94)

where, σ is a small positive number. It is noted that the constrained problem

may have different solution to the problem formulated in Lagrange form in (2.92),

although both of them are designed with similar motivations.

Since the above-mentioned optimization problem for the neural network train-

ing is normally nonlinear, it is hard to obtain the optimal solution [121]. In this

section, the necessary conditions for the optima are introduced which is called

Karush-Kuhn-Tucker (KKT) conditions [122].

As a simplified and generalized example, the following optimization problem

is constructed with m equality constrains and n inequality constrains.

β = argminβV(β) (2.95)

s.t. hi(β) = 0, ∀ i = 1, ...,m (2.96)

s.t. gj(β) ≤ 0, ∀ j = 1, ..., n (2.97)

where, the objective function V : RN → R, the constraint functions hi : RN → R

and gj : RN → R are all assumed to be continuously differentiable at the optimal

point β.

For the above optimization problem, we can have the following KKT multi-

52

plier,

L(β,λ,u) = V(β) +m∑i=1

λihi(β) +n∑j=1

ujgj(β) (2.98)

where, λi is the equality constraint factor, and uj is the inequality constraint

factor. Then, we can have the following conditions at the optima point β.

Stationarity

∇βV(β) +m∑i=1

λi∇βhi(β) +n∑j=1

uj∇βgj(β) = 0 (2.99)

Primal feasibility

hi(β) = 0, ∀ i = 1, ...,m (2.100)

gj(β) ≤ 0, ∀ j = 1, ..., n (2.101)

Dual feasibility

uj ≥ 0, ∀ j = 1, ..., n (2.102)

Complementary slackness

ujgj(β) = 0, ∀ j = 1, ..., n (2.103)

In special cases, when there is no inequality constraint, the above-mentioned

conditions become the Lagrange conditions. Then, the problem can be solved by

Lagrange multipliers. The effects of the inequality constraints are illustrated in

Fig. 2.13. It is noted that, when the optimum of the objective function is outside

of the inequality constraint region, the optimal solution is on the boundary of

the constraint region. In this case, the inequality constraint is equivalent to the

53

Figure 2.13: Graphical explanation for the inequality constraints: (a) optimalpoint is within the constrained region; (b) optimal point is on the constrainboundary.

corresponding equality constraint.

The KKT conditions are widely applied to solve and analyse the nonlinear

optimization problems [123]. It is noted that the conditions are the first order

necessary conditions for optimality, and become sufficient when the optimization

problem is convex for minimization or concave for maximization. However, in

many nonlinear optimization problems, the equations and inequalities usually

cannot be solved directly, many optimization techniques have been developed to

obtain the KKT solution.

2.8.2 Tikhonov Regularization

In practice, due to the noises and disturbances in the data, the ordinary least

square estimator always performs poorly [27]. The unknown input-output re-

lationship can hardly be uniquely reconstructed from the noisy training sam-

ples which contain much irrelevant information. In neural network learning, a

54

well-trained network function is able to retrieve output patterns with the corre-

sponding input patterns. It is noted that the construction of such mathematical

mapping is equivalent to the design of a hypersurface which links the input and

output uniquely and continuously.

Tikhonov regularization has been developed to be served as a stabilizer to

smooth the cost function, which can also be considered as a model complexity

penalty [124, 125]. For the problem in (2.92), the Tikhonov regularized optimiza-

tion problem can be formulated as,

min1

2

∥∥∥N (U ;α, b(1);β, b(2))− T∥∥∥2 +

λ

2

∥∥∥D(N (U ))∥∥∥2 (2.104)

where, D(·) is a differential operator, λ is the regularization parameter. It is

noted that the regularization term is designed in the form of L2 norm. As shown

in (2.104), the updated optimization problem aims at minimizing the distance

function between the model outputs and training targets, as well as constraining

the derivatives of the network function with respective to the input. As a result, a

much smoother hyper-surface of the network function can be constructed, which

could result in a much more stable and robust solution, compared to the solution

without regularization.

Remark 2.8.1. The ill-posed optimization problem regularly occurs in practical

engineering. In supervised learning with large training dataset which contains

a lot of noise and unrelated system information, the Hadamard conditions for

constructing well posed mathematical mapping could be violated [27, 126]. The

L2 norm of the neural network function differential term in (2.104) can be used

to represent the complexity of the model. By constraining the derivative of the

function with respect to the input, the smoothness of the network function can

be improved. Similar outputs could be generated with similar inputs. Thus, the

generality of the regularized solution could be potentially improved.

55

For the ill-posed optimization problems, compared to the optimal solution

without regularization, a better-conditioned optimum could be obtained by adding

the Tikhonov regularization term. It is noted that, for neural network training,

similar results could also be obtained by adding noise to the input training data

[127]. In addition, the over-smoothing problem may occur, even though the reg-

ularization parameters are well-determined [128]. A generalized regularization

theory has been developed for semi-supervised learning, which is called manifold

regularization [129]. An improved Tikhonov regularization method was developed

in [130], by re-designing the regularization matrix with the help of truncated sin-

gular value decomposition. It shows significant improvements in many ill-posed

optimization problems. Further work about different construction methods of the

regularizer terms is still worthwhile to be investigated.

2.8.3 L1 Regularization and LASSO

Similar to (2.104), by replacing the L2 regularization term by the L1 penalty, we

can have the following regularization problem [131],

min1

N

∥∥∥N (U ;α, b(1);β, b(2))− T∥∥∥2 + λ

∥∥∥D(N (U ))∥∥∥

1(2.105)

which can also be formulated in the form of constrained optimization problem as,

min1

N

∥∥∥N (U ;α, b(1);β, b(2))− T∥∥∥2 (2.106)∥∥∥D(N (U )

)∥∥∥1≤ t (2.107)

The above optimizati1on scheme is originally developed as a least absolute

shrinkage and selection operator (LASSO) [132]. It is noted that the L1 norm

regularization could always generate robust solutions with sparse function param-

eters. In [133], with the L1 minimization, a spare representation based classifica-

56

Figure 2.14: Different characteristics of the norm functions with one dimensionalinput case: (a) L1 norm loss function L1(A) = |A|; (b) L2 norm loss functionL2(A) = 1/2|A|2; (c) the derivative function of L1 norm; (c) the L2 norm deriva-tive function dL2(A)/dA = A

tion approach was developed for the face recognition. The L1 regularization has

been applied for the signal recovery from inaccurate and incomplete system obse-

vations [134]. Furthermore, many application of L1 regularization can be found in

various fields including, signal and image reconstruction [135, 136], compressive

signal sampling [137], and signal identification [138].

Remark 2.8.2. As shown in Fig. 2.14, the L1 and L2 norm functions present

different geometric features. The derivative of the L1 is constantly +1 for positive

input, and −1 for negative input. In contrast, the derivative function of the L2

norm is an increasing function. Taking the one dimensional input case in Fig.

2.14 as an example, we can have the gradient descent based optima searching

process for L1 norm function as

57

A = A− η d

dAL1(A)

=

A− η A > 0

A+ η A < 0(2.108)

where, η is the searching step. During the above gradient descent process, the input

parameter A is reduced toward zero with the rate η. Thus, the L1 regularization

could shrink the parameters towards zeros, and generate sparse solutions.

For L2 norm, the gradient descent process can be written as,

A = A− η d

dAL2(A)

= A− ηA (2.109)

The slope of L2 norm function decreases to zero when the input A is close to zero.

Thus, the L2 norm regularization do have a effect of weight decay, but cannot

generate a sparse solution. However, due to its continuity, the L2 regularization

is much more computationally efficient, compared to L1 regularized approaches.

It is noted that, compared to the Tikhonov regularization, it is more diffi-

cult to obtain the optimal solution for the L1 regularization problem, due to the

discontinued function. Many research has been carried out [139] to address this

problem. In [140], a new solution for the LASSO is developed, which is called

the alternating direction method of multipliers. In addition to the L1 and L2

regularization, many other norm functions have also been applied for optimiza-

tion technique. L0 norm regularization has been applied in the deep encoder to

generate a sparse solution. Furthermore, combined norms function has also been

developed for joint regularization, e.g. L0,∞ norm [141], and L2,1 norm [142].

58

2.9 Turbine Engine System Modelling and Prog-

nostics

The turbine engine has been widely employed as the power source in aircraft,

ships, and trains. To improve the reliability and availability of the turbine engine,

a lot of research attention has been attracted on the engine system modelling,

health condition monitoring, turbine engine diagnosis and prognostics [143, 144].

The performance degradation process modelling of the turbine engine could be

considered as a typical example of dynamical system modelling problems. In this

section the turbine engine system modelling is firstly introduced. Then, some

existing techniques for the turbine engine system prognostics are reviewed.

2.9.1 Turbine Engine System Modelling

As a complex mechatronic system, a turbine engine mainly consists of a compres-

sor which is used to compress air by high pressure, combustor which is also known

as combustion chamber where the burning of the mixed fuel and compressed air

takes place, and turbine which transfers the energy from the high-pressure and

combusting gas. For different types of turbine engines, different additional com-

ponents are constructed to increase engine efficiency. For example, in a turbofan

engine, a ducted fan is normally designed to accelerate the airflow and generate

the thrust [145].

All these components of the turbine operate under sophisticated mechanisms,

which makes it very challenging to perform the mathematical modelling for the

whole turbine engine. Currently, there are mainly two major topics for the turbine

engine system modelling: engine performance modelling, and engine deterioration

modelling, which are discussed below.

59

Engine Performance Modelling

The engine performance modelling plays an important role for us to understand

the engine operation mechanism, and provide information for engine production

and maintenance. Furthermore, since the data sources of the aircraft engine

are limited and controlled by turbine engine manufacturers, a well-constructed

model thus becomes necessary to generate high-quality engine data and develop

the data-driven based approaches for engine system diagnosis and prognosis [146].

During the last half-century, a lot of research has been carried out for gas

path analysis [147], and the engine performance modelling [148, 149]. Based on

the thermodynamic principles, the theory of gas turbine has been founded and

developed from the perspective of turbine design and off-design. However, the

theoretical models are based on simplifications, which cannot completely describe

the complex mechanism of the whole turbine engine. Furthermore, as addressed

in [144], another major problem is the lack of component maps, which means

that it is very hard to obtain accurate parameters and states of engine compo-

nents, e.g. pressure ratio, rotational speed, and isentropic efficiency, by using

the theoretical models. In light of that, further research has been conducted. In

[150], a system identification based method for gas turbine component maps is

developed. With the help of system identification techniques and experimental

engine data, component map modelling scheme is able to achieve much smaller

modelling error, compared to the traditional method.

Furthermore, in [151], the performance simulation of an aeroderivative gas

turbine is conducted in the Matlab Simulink. Based on that, further research has

been conducted for the component map modelling [152, 153], and performance

adaptation modelling for the gas turbine. It is noted that the concept of adapta-

tion is very useful in engine performance modelling [154]. By applying the opti-

mization techniques, with limited data sources, the adaptation modelling method

60

is able to identify the relationship between different system measurements, thus

obtain the physical related mathematical model for turbine maintenance and

prognostics.

Engine Deterioration Modelling

In addition to the engine performance modelling, the deterioration and damage

propagation modelling of the turbine engine has also attracted a lot of attention.

Different from the engine performance evaluation and monitoring, deterioration

modelling aims at capture and evaluate the turbine development trend. Com-

pared to the static performance evaluation, the dynamic process modelling is

much more challenging.

The effects modelling of different types of compressor degradation, e.g. fouling

and erosion, was investigated in [155]. With the help of engine field observations,

the developed models for the two most common causes of compressor deterioration

were developed and validated. In [156], the performance deterioration modelling

and prediction are carried out for aircraft gas turbine engines. The deterioration

modelling was mainly focused on the compressor and turbine. A so-called stage

stacking technique was applied for the compressor modelling. A tip rub model is

employed for the turbine deterioration process.

Figure 2.15: Structure of a simplified turbofan engine model

61

Furthermore, a damage propagation model is developed for the run-to-failure

process of the aircraft engine, which is conducted by NASA [157]. As shown in

Fig. 2.15 [158], the developed system model was mainly focused on five differ-

ent rotating components, which are, fan, low-pressure compressor (LPC), high-

pressure compressor (HPC), high-pressure turbine (HPT), and low-pressure tur-

bine (LPT). The outputs of the model are twenty-one different simulated sensory

data. The generated dataset is also used as a major study case in this thesis.

2.9.2 Turbine Engine System Prognostics

In addition to the turbine system modelling, the engine fault diagnosis and prog-

nostics have also attracted a lot of attention from the engineers and researchers.

The system diagnosis aims at identifying the system faults by detected sensory

information. In contrast, the system prognostic is focused on capturing the sys-

tem developing features and predict the future system states. According to the

literature [143, 144], the current system prognostic techniques for turbine engine

can be divided into two categories: model-based and data-driven methods. The

model-based schemes are mainly based on the modelling techniques as mentioned

in the above section. To avoid repetition, this section mainly focuses on the

review of the data-driven prognostic methods.

One type of system prognostic approach is the statistical method based. In

[159], a linear and quadratic combined regression model is developed for the sta-

tistical prognostic analysis of the engine degradation. As addressed in the paper,

with different failure rate developing states, different models may be selected for

the prognostics. For example, when the failure rate is changing, the linear re-

gression model cannot handle the non-linearity. However, the linear model could

perform effectively at the stage that the failure rate is constant. Furthermore,

Many research focused on the predictions and prognostics of some basic engine

62

elements by using different statistical approaches, e.g. kurtosis analysis [160],

autoregressive moving average (ARMA) model [161, 162], and Monte Carlo sta-

tistical method [163].

Another main scheme of turbine engine prognostic is based on machine learn-

ing and neural network techniques. With the rapid development of the machine

learning techniques, the ANNs based technique has been successfully applied for

the turbine engine prognostics. Different dynamic neural network (DNN) based

schemes have been developed for the engine fault detection, diagnosis and prog-

nostics [164, 165]. With the recorded engine sensory measurements, the developed

neural network based dynamical modelling scheme has shown excellent perfor-

mance on the turbine fault diagnosis and prognostics. Furthermore, the enhance

the modelling capability, the DNN based ensemble scheme has also been applied

for the turbine prognostics, which showed a significant improvement [166].

2.10 Conclusions

In this chapter, the dynamical neural network modelling, neural network learning

technique and its training approaches have been comprehensively reviewed and

discussed. The basics of dynamical system modelling and the SLFNs structure

have been first introduced. Then, the existing recurrent structures of the neural

network have been presented, which have been designed to enhance the modelling

capability and time-related information storage capacity, followed by the intro-

duction of a special framework of RNNs which is called ESNs. After that, the

multi-hidden layered neural networks are described with some discussions on the

learning capability of the SLFNs and TLFNs. A brief overview of deep neural

networks is also presented. In addition, the training process of the neural network

has been formulated with discussions of the BP algorithm and several gradient

63

descent methods. Furthermore, a randomization based neural network learning

scheme has been elaborated, from the Monte-Carlo stochastic approximation to

the ELM theory. Some new works of the ELM technique are also included. To

address the over-fitting problems and enhance the neural network training, the

norm function based regularization techniques have also been introduced and dis-

cussed. Finally, as one of the major study cases of this thesis, the turbine engine

system modelling and prognostics techniques have been reviewed.

64

Chapter 3

A NARX Network-based

Approach for Dynamical System

Modelling with Application to

Turbofan Engine Prognostics

In this chapter, a NARX recurrent neural network model with output self-feedback

structure is developed for the dynamical system modelling problem with sequence

output. In contrast to the existing learning strategies of NARX networks, a much

faster optimization algorithm is developed based on the Monte-Carlo stochastic

training and the regularized batch learning type of least squares. The proposed

NARX structured neural network (NARX-DNN) is applied to a turbofan engine

prognostic problem. The model validations are carried out by using the NASA

turbofan engine dataset, with comparisons to the ELM, DNN with output self-

feedback, and the traditional NARX network. The results show that, with smaller

hidden nodes number and cheaper computational cost, the proposed NARX-DNN

achieves superior prediction accuracy and exhibits stronger robustness.

65

3.1 Introduction

Dynamical system modelling aims at performing the system identification and

control, based on system measurements and observations [21]. Specifically, in

complex mechatronic systems, a number of sensors are employed to collect the

information for system monitoring and improving the operating reliability [167,

168, 169]. However, due to the complexity, the variation of operating conditions

and system uncertainties, the recorded sensory data is inherently noisy. To over-

come system disturbance and noise, sequence data processing and filtering ap-

proaches are normally employed [161, 170]. Thus, in this chapter, the dynamical

system problem with sequential input and output is investigated.

ANNs have attracted extensive attention from various research fields in recent

years. Due to the strong capability in dealing with non-linearity, ANNs have been

successfully employed in image processing, dynamic system modelling [21, 29, 52,

171], and system fault diagnosis [172, 173]. Among the traditional neural network

modelling methods, the BP algorithm has been widely used [27, 34]. However,

time-consuming training and local minima problems have been recognized as the

main issues of BP training.

In contrast, randomization based learning scheme has been developed for

ANNs [88], e.g. random vector functional-link (RVFL) network [9, 174], and

ELM [10, 90], which are known for fast training speed. For the single-hidden

layer feedforward neural networks (SLFNs), the input weights are randomly con-

strained, and the output weights are optimized by using the conjugate gradient

approach or a single step learning of pseudo-inverse. Such stochastic training

scheme can obtain very fast batch training process and good performance in

various applications [87, 91, 175]. Based on the randomly generated invertible

hidden feature matrix, the training of ELM becomes a linear optimization prob-

66

lem, which can be achieved by the batch learning type of least square. The local

minima and slow convergence training problems are avoided. However, according

to the Monte-Carlo based stochastic training principle [89, 176], the inputs of

the ELM are normally mapped to a high dimension space, which leads to a large

matrix inverse operation in the optimization process. As a result, the fast train-

ing in ELM always comes with expensive computational costs (e.g., high CPU

usage) during simulations. Further research has been carried out to improve the

structural efficiency of the stochastic training strategy [33, 177, 178].

On the other hand, to detect time behaviours from the sequence data, recur-

rent neural networks (RNNs), e.g. Hopfield neural networks [39], Jordan networks

[40], and Elman networks [41], have been developed in parallel with SLFNs since

the 1980s. The recursive feedback structures are designed to provide the neural

networks with dynamic memory and to enhance the non-linear dynamical sys-

tem modelling capability [179]. Those models are normally trained by the back-

propagation through time (BPTT), the slow convergence and gradient vanishing

problems are found as main issues [180]. In addition, a NARX neural network was

investigated in [12, 43] for the dynamical system modelling. It was shown that the

NARX network can be trained as static networks using the series-parallel model

in [21] to avoid the gradient vanishing problem. However, the time-consuming

training is still a big issue for practical applications of the RNNs.

Recently, a new framework of DNN with output self-feedback has been pro-

posed in [11, 47] for crack growth propagation modelling. Similar to the method-

ology of ELM training, hidden activations are randomly generated from input

signals. Both the output weights and feedback weights are optimized by the reg-

ularized batch learning type of least squares. Excellent performance has been

demonstrated with the application to the crack growth modelling in ductile al-

loys. Based on that, a DNN with different feedback loops has been proposed in

[181]. However, the non-linear function of the DNN was linearized for the fast

67

training purpose, which might affect its performance on the non-linear dynamical

modelling.

In this chapter, a NARX-DNN model is developed for the dynamical system

modelling problem. Two different feedback loops are designed to express different

system dynamics. The main work of this chapter is listed as follows:

i. A Monte Carlo approximation based stochastic training scheme is devel-

oped for the NARX type neural network.

ii. Compared to the conventional BP trained NARX network, a much faster

training process is developed for the NARX-DNN. The learning process of

the DNN-NARX is completed in three batched steps without recursively

parameters tuning, the local minima and show-convergence problem are

avoided.

iii. Compared to the feed-forward neural network trained by similar stochas-

tic learning scheme, the robustness of the stochastic training approach is

significantly improved by designing output feedback loops.

iv. The proposed NARX-DNN modelling is successfully applied for the turbine

system modelling and prognosis with excellent performance.

The rest of the chapter is organized as follows: in Section 3.2, the non-linear

dynamical system modelling problems with sequence output and the training

problems are introduced. In Section 3.3, the proposed NARX-DNN model is

formulated, with a simplified DNN for non-linear feedback weights estimation. In

Section 3.4, the optimization process of the proposed NARX-DNN is presented.

In Section 3.5, using the NASA turbofan engine datasets, the simulations and

comparisons of different neural networks are carried out. Section 3.6 concludes

the chapter and provides the direction for future work.

68

3.2 Problem Formulation

To represent and approximate non-linear system dynamics, neural network based

modelling approaches have been developed. In this section, the neural network

based non-linear dynamical modelling problems with sequence output are dis-

cussed. Several neural network models with different feedback structures are pre-

sented, followed by the discussions of training problems of the dynamical neural

models.

3.2.1 Dynamical Neural Modelling with Sequence Output

As discussed in Section 2.1.4, several different neural network models have been

successfully developed and applied for dynamical system modelling. However,

most of the related research is about dynamical neural modelling with single out-

put. In real-world applications, the signal is always noisy. A single measurement

with noise cannot represent the system state. Instead, a sequence of measurement

is able to show rich dynamical information. Thus, in this section, the dynamical

system modelling problems with sequence output are introduced.

Assume that ut = [ut, ..., ut−n+1]T is the recorded system input sequence at

time t, where ut−i, i ∈ 0, 1, ..., n− 1, is the recorded past system input at time

t − i, n is the input length, yt = [yt+1, yt+2, ..., yt+m]T is the output sequence at

time t, where yt+k, k ∈ 1, 2, ...,m, is the future system state to be predicted, m is

the length of prediction sequence which is pre-setted according to the modelling

task, ybt = [yt, yt−1, ..., yt−l+1]T is the output feedback sequence at time t, where

yt−j, j ∈ 0, 1, ..., l − 1, is the recorded past system state, the feedback length l is

normally selected as l ≥ m, then, we can have some generic forms of dynamical

neural modelling with sequence output formulated as follows,

69

Model I:

yt = wTb ybt +N1

(ut, ..., ut−n+1

)(3.1)

Model II:

yt = N2

(yt, ..., yt−l+1

)+wTut (3.2)

Model III:

yt = N3

(yt, ..., yt−l+1;ut, ..., ut−n+1

)(3.3)

where, N1 : Rn → Rm,N2 : Rm → Rm,N3 : Rl+n → Rm, and N4 : Rk → R are

non-linear neural network functions, w ∈ Rn×m and wb ∈ Rl×m are parameter

matrices. In Model I, the output vector is assumed to be linearly related to

the past values and nonlinearly affected by the inputs. In Model II, the system

output state evolution is designed as a non-linear process with linear effects from

the inputs. The Model III is in the forms of the NARX networks.

Remark 3.2.1. The above-mentioned four neural models are all designed to

deal with the dynamical relationship between the system future output sequence

yt = [yt+1, yt+2, ..., yt+m]T and the system current input time sequence ut =

[ut, ..., ut−n+1]T , as well as the system feedback yt = [yt, ..., yt−l+1]

T . The major

differences of those models are the input and feedback network structures. The

output dynamical dependences on input sequence are expressed by non-linear neu-

ral network functions in Model I and III, and by a simple linear model in Model II.

On the other hand, the system output feedback is structured with a linear autore-

gressive scheme in Model I, and is designed in a neural network manner in Model

II, III. It is noted that the design of the network structure might vary for different

problems. In this chapter, the impacts of different network feedback structures on

the performance of dynamical system modelling are further investigated.

70

3.2.2 Training Approaches for Dynamical Neural Models

In addition to the structures of the dynamical neural models, the network training

approach is also very important. As mention in Section 2.1.3, the dynamical

system modelling and identification can be divided into two categories: parallel

model and series-parallel model.

The parallel model is designed to learn system information from the system

inputs, outputs and the model output feedback data. In contrast, the series-

parallel model is designed with feedback directly from recorded plant output.

Taking Model III in (3.3) as an example, the parallel neural model can be written

as,

yt = N3

(yt, ..., yt−l+1;ut, ..., ut−n+1

)(3.4)

where, yt = [yt+1, ..., yt+m]T is the output vector of the model at time t, yt, ..., yt−l+1

is the feedback sequence from the model output. In contrast, the corresponding

series-parallel model is written as,

yt = N3

(yt, ..., yt−l+1, ut, ..., ut−n+1

)(3.5)

It is noted that the parallel model in (3.4) is structured with the feedback from

its own outputs. Since the feedback sequences are changed after updating the

model parameters, the dynamic optimization methods are required for training

such models. In many practical problems with noisy training data, the slow

convergence of the parallel model training is a big concern. Differently, static

training approached can be applied to train the series-parallel model, which could

lead to a much faster modelling process.

Remark 3.2.2. The training approach like BP is designed for static models,

where the output can be directly generated from current network input, it is

71

straightforward to calculate the output error and apply the static training method.

However, for the dynamic models, the model output is also dependent on the

model output feedback or network hidden states, which are dynamically changing

with the updating parameters during the training. Thus, dynamic training method

like BPTT is normally necessary for the dynamic models. However, the dynamic

training process is much more time-consuming, compared to the static training. In

some cases with noisy training data, it is very hard to reach the optimal solution

by the conventional dynamic training approach, which motivates us to develop an

efficient dynamical modelling scheme.

3.3 The NARX-DNN

In this section, a new NARX-DNN model is proposed for non-linear dynamical

system modelling. The inputs, outputs, and feedback are all constructed in the

form of time sequences, shown as tapped-delay-lines in Fig. 3.1. The dashed

connections in outputs indicate the time relationship of the output samples during

the training process. After training, the links in the output layer are eliminated,

the outputs are calculated independently from hidden activations and output

feedback.

3.3.1 Structure of The NARX-DNN with Multi-sequences

Input and Sequence Output

Considering the modelling problems with multi-sequences input, we can construct

the input vector of the model at time t, by cascading all the input sequences as

ut = [u1t,u2t, ...,upt]T , where the ith input sequence can be written as uit =

[uit ui(t−1) ... ui(t−d+1)]T , p is the number of the input sequences, and d is delay

72

Figure 3.1: The network structure of NARX-DNN

number. The input weights matrix can be written asw = [w1,w2, ...,wN ], where,

wi = [w1i, w2i, ..., wni]T , i ∈ 1, 2, ..., N , and the bias vector is b = [b1, b2, ..., bN ]T ,

N is the hidden nodes number, and n = p · d is the input nodes number.

By defining non-linear feedback weights matrix as wb = [wb1 ,wb2 , ...,wbN ],

with wbj = [w(n+1)j, w(n+2)j, ..., w(n+l)j]T , l is the number of feedback nodes, the

output of the ith hidden node at time t can be obtained as follows:

hit = h(wTi ·ut +wT

bi·ybt + bi) (3.6)

where,ybt = [yt, yt−1, ..., yt−l+1]T is the feedback sequence at time t.

As shown in Fig. 3.1, the network is constructed with N hidden nodes, the

output vector of the hidden layer at time t can be written as

73

ht = h(wT ·ut +wTb ·ybt + b) (3.7)

For the purposes of time series predictions and system prognostics, the final

output sequence of the network is organized as,

yt = [yt+1, yt+2, ..., yt+m]T (3.8)

where, m is the output nodes number, yt+k is the kth output at time t, k ∈

1, 2, ...,m.

The output weight matrix of the neural network can be formulated as β =

[β1,β2, ...,βm], where βk = [β1k, β2k, ..., βNk]T , and βb = [βb1 ,βb2 , ...,βbm ] is the

output self-feedback weights matrix, where βbj = [β(N+1)j, β(N+2)j, ..., β(N+l)j]T ,

we can have the proposed NARX-DNN model formulated as follow,

yt = βT h(wTut +wTb ybt + b) + βTb ybt (3.9)

It is noted that there are two types of output feedback involved in the model

shown in (3.9). During the training process, the model is induced to learn different

feedback dynamic features through different feedback loops. Similar to the NARX

networks, the non-linear hidden layer feedback loops generate non-linear impacts

on the outputs. The self-feedback loops are used to express output evolution

process.

3.3.2 A Simplified NARX-DNN for Training

In this section, a simplified dynamic neural model is established for the evaluation

of non-linear feedback weights. The non-linear activation function employed in

the neural model is an essential operating element in non-linear system modelling.

74

In the developed model, the hyperbolic tangent function is used,

h(x) = (eu − e−u)/(eu + e−u) (3.10)

which operates approximately linearly with a slope of one when u ∈ (−θ, θ), θ

is a small positive number, and saturates to −1 and 1 when x tends to infin-

ity. In order to improve the learning speed [182], the hidden nodes are always

constrained to work in the nearly-linear region of the non-linear function. By

properly choosing the randomized range of w, wb, and b, the following condition

can be obtained,

− θ < wTi ut +wT

biybt + bi < θ (3.11)

which constrains nonlinear nodes to operate in an approximately linear region.

Thus, the output of the ith hidden node in (3.6) can be simplified to a linear

expression,

hit = wTi ut +wT

biybt + bi (3.12)

Then, the kth output of the simplified DNN at time t can be expressed as,

yt+k =N∑i=1

βikhit +l∑

j=1

β(N+j)kyt−j+1

=N∑i=1

βik

(wTi ut +

l∑j=1

w(N+j)iyt−j+1 + bi

)

+l∑

j=1

β(n+j)kyt−j+1 (3.13)

75

which can be written as

yt+k =β1k

(w1ut +

l∑j=1

w(n+j)1yt−j+1 + b1

)

+β2k

(w2ut +

l∑j=1

w(n+j)2yt−j+1 + b2

)+ ...

+βNk

(wNut +

l∑j=1

w(n+j)Nyt−j+1 + bN

)+

l∑j=1

β(N+j)kyt−j+1 (3.14)

Then, we can have

yt+k =N∑i=1

βik(wTNut + bi) +

l∑j=1

β(N+j)kyt−j+1 +l∑

j=1

αjkyt−j+1 (3.15)

where,

α1k

α2k

...

αn2k

=

w(n+1)1 w(n+1)2 . . . w(n+1)N

w(n+1)1 w(n+1)2 . . . w(n+1)N

......

. . ....

w(n+1)1 w(n+1)2 . . . w(n+1)N

β1k

β2k...

βNk

(3.16)

Remark 3.3.1. From (3.12) to (3.16), the output of the network is reformulated

under the condition of (3.11). It is noted that the kth output at time t consists

of three components,

yt+k = y(i)t+k + y

(ii)t+k + y

(iii)t+k (3.17)

where, y(i)t+k =

∑Ni=1 βik(w

Ti ut + bi) is projected from the current sensory data

sequences, y(ii)t+k =

∑lj=1 β(N+j)kyt−j+1 represents the output autoregressive part

of the model. y(iii)t+k =

∑lj=1 αjkyt−j+1 is the simplified non-linear feedback part

which is used to express the system non-linear propagation mechanism. It is

noted that the system’s evolution is herein described as a collective result of both

the autoregressive and non-linear processes.

76

The hybrid output feedback weights αk = [α1k α2k ... αlk]T in (3.16), can also

be formulated as

αk = wTb βk (3.18)

Then, the hybrid output feedback weights matrix can be obtained as α =

[α1,α2, ...,αm]. The input generated part of hidden output vector can be written

as hin

t = [wT1ut + b1,w

T2ut + b2, ...,w

TNut + bN ]T . Defining the output vector of

the simplified model as yt = [yt1 ... yt1 ... ytm]T , based on (3.15) to (3.18), we can

have

yt = βT hin

t + βTb ybt +αTybt (3.19)

or

yt = CTH t (3.20)

where, the hybrid weights matrix can be defined as C = [β, βb, α]T . H t =

[hin

t ,ybt ,ybt ]T is the combined training data vector at time t. Assuming that we

have N training input samplesU = [uτ , uτ+1, ..., uτ+N−1], where τ is the starting

point for training, and the corresponding output of the simplified model Y =

[yτ , yτ+1, ..., yτ+N−1] , the batch type of the simplified model can be formulated

as:

Y t = CTH (3.21)

where the combined training data matrix H is shown in (3.22).

Remark 3.3.2. The simplified model is now reformulated as a linear form in

(3.21) where the training problem has been transformed to the optimization of

the hybrid weight matrix C. However, the accuracy of the model is based on the

condition of (3.11) which might not be met in some cases. Similar to the model

77

in [181], the simplification might affect the non-linear modelling capability of the

neural network. Thus, the simplification is just applied for the non-linear feedback

weights evaluation, and use the model in (3.9) for the non-linear dynamical system

modelling. The model optimization process is formulated in Section 3.4.

wT

1 uτ + b1 . . . wTNuτ + bN yτ . . . yτ−l+1 yτ . . . yτ−l+1

wT1 uτ+1 + b1 . . . wT

Nuτ+1 + bN yτ+1 . . . yτ−l+2 yτ+1 . . . yτ−l+2

.... . .

......

. . ....

. . ....

wT1 uτ+N−1 + b1 . . . wT

Nuτ+N−1 + bN yτ+N−1 . . . yτ+N−l−1 yτ+N−1 . . . yτ+N−l−1

(3.22)

3.4 Randomization-based Training for The NARX-

DNN

To achieve an efficient optimization process and a robust non-linear system mod-

elling, a new optimization algorithm is developed for the proposed NARX-DNN.

Similar to the series-parallel model in [21], static training is applied to avoid

the gradient vanishing. In addition, the non-linear optimization process of the

proposed DNN is resolved into two quadratic optimization problems.

3.4.1 Non-linear Feedback Weights Estimation

For training of the simplified model in (3.21), an objective function can be de-

signed based on the distance between the model outputs and desired training

78

targets,

min {γ12‖εf‖2+

d12‖C‖22}

s.t. εf = Y d −CT H (3.23)

where Y d = [yτ , yτ+1, ..., yτ+N−1] is the desired output training matrix, εf =

[εf1 , εf2 , ..., εfN ] is the error matrix between the desired output matrix and the

output of the simplified DNN, γ1 and d1 are the positive real regularization pa-

rameters, ‖C‖2 is the regularization term, which can also be considered as a

model complexity-penalty. The l2 norm regularization aims at smoothing the

error surface, and stabilizing the solution [27, 33].

For the minimization problem in (3.23), the Lagrange multiplier can be con-

structed as follow:

L1(εf ,C,λ) =γ12

N∑i=1

m∑k=1

ε2fik +d12

N+2n2∑j=1

m∑k=1

c2jk

−N∑i=1

m∑k=1

λki

(y(t+i−1)k −

N+2n2∑j=1

Hjicjk − εfik)

(3.24)

where, εfik is the ikth element of the error matrix εf , y(t+i−1)k is the ikth desired

output, Hji and cjk are the ijth element of hybrid feature matrix H , and jkth

hybrid weights vector C respectively, and λik is the ikth Lagrange multiplier

factor.

To solve the optimization problem in (3.23), the following equations can be

obtained based on the Kuhn-Tucker conditions, and the problem can be solved

by

79

∂∂εfikL1 = γ1εfik + λik = 0

∂∂cjkL1 = d1cjk +

∑Ni=1 λikHji = 0

∂∂λikL1 = y(t+i−1)k −

∑N+2n2

j=1 Hjicjk − εfik = 0

(3.25)

With N training patterns, and batch learning technique, the following equa-

tions can be obtained. λ = −γ1εf

d1CT = −λHT

εf = Y d −CTH

(3.26)

Then, it can be solved as,

CT

= Y dHT

(d1/γ1I + HHT

)−1 (3.27)

where, I is a (N + 2n2)× (N + 2n2) identity matrix.

Since the input weights and biases are randomly and uniformly assigned, the

hybrid feature matrix H can be derived as (3.22). The hybrid weight matrix

CT

is estimated based on (3.27). The output weights, autoregressive feedback

weights, and hybrid feedback weights can then be obtained.

[βTβT

b αT ] = C

T(3.28)

According to (3.18), the nonlinear feedback weight matrix can be estimated as

wb = (βT

b )†αT (3.29)

where (βT

b )†is the Moore-Penrose generalized inverse of the output weights matrix.

The optimization process (3.24) to (3.29) is carried out based on the simplified

80

model shown in (3.21), where the non-linear activation function in the NARX-

DNN is replaced by a linear operation. Then, the non-linear feedback weight wb

can be evaluated by (3.29). To tackle the issues discussed in Remark 3.3.2, the

output weights and autoregressive weights optimizations are carried out based on

estimated wb.

3.4.2 Output Weights and Autoregressive Weights Opti-

mization

With the estimated non-linear feedback weight matrix wb, randomly selected

input weight matrix w and bias vector b, the hidden output of the NARX-DNN

at time t is computed as

ht = h(wTut + wTb ybt + b) (3.30)

Based on the training dataset {X,Y d,Y bt}, we can obtain

Y = βTc H (3.31)

where, H = [hτ , hτ+1, ..., hτ+N−1] is the batched hidden state matrix, βc is

the combined output feedback weights matrix, Y is the output matrix of the

NARX-DNN. Similar to (3.23), the new optimization problem is stated as

min {γ22‖ε‖2+d2

2‖βc‖22}

s.t. ε = Y d − βTc H (3.32)

81

where, γ2 and d2 are positive real regularization parameters, ‖βc‖22 is the l2 norm

regularizer, and its corresponding Lagrange function becomes

L2(ε,βc, λ) =γ22

N∑i=1

m∑k=1

ε2ik +d22

N+n2∑j=1

m∑k=1

β2cjk

−N∑i=1

m∑k=1

λki

(y(t+i−1)k −

N+n2∑j=1

Hjiβcjk − εik)

(3.33)

where, εik is the ikth element of the error matrix ε defined in (3.32), y(t+i−1)k

is the ikth desired output, Hij and βcjk are the ijth element of batched hidden

output matrix H , and jkth element of output feedback combined weights matrix

βc respectively, and λik is the ikth Lagrange multiplier factor.

Similar to (3.25) ∼ (3.27), using Kuhn-Tucker conditions, the following solu-

tion can be obtained.

βT

c = Y dHT

(d2/γ2I + HHT

)−1 (3.34)

Then, the optimal output weights matrix β, and autoregressive feedback weights

matrix βb, can be obtained as

[βTβT

b ] = βT

c (3.35)

Remark 3.4.1. The pseudo-code of the proposed NARX-DNN training algorithm

is presented in Table 3.1. To achieve an efficient optimization process, the non-

linear optimization problem has been simplified into two quadratic optimization

problems, and the recursively training process is avoided by applying the batch

learning technique. By randomly selecting the input weights within a small range

[-0.1,0.1] as (3.11), the hidden nodes will mostly operate within a nearly-linear

regime when the non-linear feedback weights are evaluated. Furthermore, the im-

pact of the non-linear feedback loops on the current output is tuned via the opti-

82

Table 3.1: The pseudo-code of the proposed NARX-DNN

Algorithm 1: NARX-DNN, the proposed algorithm with MonteCarlo approximation based batch training. The default parametersettings for the tested turbine prognostic problem are σ = 0.1,d1/γ1 = 0.001, and d2/γ2 = 0.001.

I. Nonlinear feedback weights evaluation1. Randomly and uniformly select the input weight matrix w

and bias vector b within the range of [−σ, σ].

2. Obtain the hybrid feature matrix H in (3.22) based on thesimplified model.

3. Compute the optimal hybrid weights matrix as (3.27).

4. Derive the output weight matrix βT

, and hybrid output feed-

back weight matrix αT from the hybrid weights matrix CT

as(3.28).

5. Evaluate the nonlinear feedback weight as (3.29).II. Output Weights and Autoregressive Weights Optimization

6. Compute the batched hidden output matrix H based on (3.30).7. Obtain the optimal output feedback combined weights matrix

βc from (3.34).

8. Derive the optimal output weight β, and autoregressive feed-back weight βb from (3.35).

mization of the output weights, which reduces the error in the evaluation process

(3.27)∼(3.29), and enhances the non-linear modelling capability of the proposed

NARX-DNN.

3.5 Model Validations and Comparisons

In order to validate the performance of the proposed model and compare it with

some RNNs, a turbofan engine system prognostic problem is studied in detail

as an example of the dynamical systems discussed in Section II.A. The turbine

engine datasets were collected from NASA in [158]. In this section, the health

index (HI) sequences of different engine units are evaluated by an ELM model as

training samples for the simulations. After that, the NARX-DNN is employed for

83

Figure 3.2: Structure of a simplified turbofan engine system

engine system HIs predictions with comparisons to the ELM, and other RNNs.

All the simulations are conducted in MATLAB 9.2 environment, using a computer

with an Intel Core i5-4590, 3.30 GHz CPU.

3.5.1 Turbofan System Health Index Evaluation

The turbofan engine system simulation used in this chapter was conducted by

NASA in the C-MAPSS environment [158]. In the simulated engine system, many

operating environmental factors, e.g. altitude, temperature, and speed are taken

into consideration by an atmospheric model [157]. The modelling of turbofan

engine deterioration is mainly focused on five rotating components: low-pressure

turbine, high-pressure turbine, high-pressure compressor, low-pressure compres-

sor, and fan, as shown in Fig. 3.2. Various sensor measurements were recorded

from different locations of engine units under different simulated operating con-

ditions, which include pressure and temperature at different axial positions of the

engine, fan speed, and ratio of fuel flow.

The dataset FD001 from [157] is used in the simulations. Data selection

and normalization were first implemented. Similar to the other related literature

[183, 184], for improving the efficiency of training, some chaotic sensory data is

84

(a)

(b)

Figure 3.3: Right aligned run-to-failure sensor measurements of 100 engine unitsfrom FD001: (a) Sensor 2; (b) Sensor 12.

eliminated by observations. Similar to [181], Several sensory sequences, indexed

by 2, 3, 4, 7, 11, 12, 15, 20 and 21, were selected as the inputs of the models. The

visualizations of sensor 2 and 12 run-to-failure measurements from 100 simulated

turbine engine units are shown in Fig. 3.3. All the sensory sequences are then

rescaled within [0, 1].

The complexity and uncertainty of turbine degradation process make it almost

impossible to accurately evaluate system HI. In this chapter, a single output ELM

is used to estimate the health state of turbofan engines. The training inputs for

85

Figure 3.4: Visualization of the degradations of different engine units generatedby ELM.

HI sequences generation are collected from the beginning and final sets of all the

engine degradation units, and the corresponding outputs are labelled as 1 and

0, respectively. After training, all the HI can be generated by using the whole

organized sensory sequences as the inputs. The network used for HI generation

is formulated as follow,

yt =M∑j=1

βjhj(wTj ut + bj) (3.36)

where yt is the output value at time t, ut is the corresponding input sensory

vector, wTj and bj represent the input weight vector and bias, which are randomly

uniformly constrained, hj is the jth hidden node output, M is the hidden nodes

number of the ELM model, which is selected as 2000 in the simulations, and βj

is the output weight.

Remark 3.5.1. The initial and final sets of the turbine engine degradation units

are used as the training dataset to generate the HI for the whole degradation

86

process, thus, the intermediate states calculated by the ELM are full of noise and

might be not accurate. However, the generated values are calculated from the

input sequences and represent the summarized system condition state. As shown

in Fig. 3.4, the generated HI sequence clearly illustrates the non-linear time-

related engine degradation processes. Thus, this generated dataset is used in the

following simulations.

3.5.2 System Prognostic and HI Prediction

Different to the HI generation, the output targets of the engine system prognostic

problem become the future engine HI sequence. The generic form of this sequence

output problem is formulated as,

yt+1, ..., yt+m = N (yt, ..., yt−l+1;ut, ...,ut−d+1) (3.37)

As discussed in Remark 3.2.1, a dynamical system modelling problem with

sequence output is investigated. In the simulations of this chapter, the output

nodes number m is selected as 10. Then, one output vector can be considered as

one system output dynamical state indicated by 10 data points. Based on that,

the input delay number d, and feedback delay number n2 are both selected as 20,

which can be considered as two system states. It is noted that the input sequence

length and feedback length should be designed according to the complexity of the

system mechanism to provide sufficient information.

The 100 run-to-failure turbofan engines dataset is divided into two parts: the

training dataset with 80 turbofan units, and the testing datasets with the left 20

turbofan units. After training on the same training dataset, multiple repeated

HI sequence prediction trials are carried out in different degradation stages of the

testing engine units, by using the ELM, the DNN with output self-feedback, and

the proposed NARX-DNN.

87

(a)

(b)

(c)

Figure 3.5: Engine early stage HI sequence predictions (repeated 20 trials) on atesting engine unit by (a) the ELM; (b) the DNN with output self-feedback; (b)the NARX-DNN.

88

(a)

(b)

(c)

Figure 3.6: Engine intermediate stage HI sequence predictions (repeated 20 trials)on a testing engine unit by (a) the ELM; (b) the DNN with output self-feedback;(b) the NARX-DNN.

89

(a)

(b)

(c)

Figure 3.7: Engine failure stage HI sequence predictions (repeated 20 trials) on atesting engine unit by (a) the ELM; (b) the DNN with output self-feedback; (b)the NARX-DNN.

90

It is noted that the development of the engine units is almost flat at the

beginning stage, with some fluctuations purely due to noise. As shown in Fig.

3.5, the ELM is very sensitive to the noise. In contrast, with the help of output

feedback loops, the prediction results of the DNN with output self-feedback and

NARX-DNN show much smaller prediction variance. In Fig. 3.6, the propagation

trend becomes much clearer at the engine degradation intermediate stage. All

these three methods perform well. However, as we can see from Fig.3.7, the

proposed NARX-DNN shows a much better performance at the failure stage of

the degradation process, compared to both the ELM and the DNN with output

self-feedback, which indicates that the nonlinear system modelling capability is

significantly enhanced.

In Table 3.2, 3.3, and 3.4, the averaged root mean square error (RMSE) and

standard deviation (SD) of 100 repeated predictions by three different models,

the ELM, the DNN with output self-feedback and the proposed NARX-DNN, are

recorded under different hidden nodes settings. RMSE1, RMSE2, and RMSE3

indicate the averaged RMSEs obtained from initial, intermediate, and terminal

testing datasets, correspondingly. It is noted that the proposed NARX-DNN with

200 hidden nodes has achieved smaller MSE and SD than the ELM with 800 hid-

den nodes, which suggests that the modelling efficiency is significantly improved

by adding 20 feedback nodes. Especially, in the terminal case, the NARX-DNN

receives the much smaller RMSE and SD, compared to the ELM and DNN with

output self-feedback, which indicates the better prediction accuracy and model

robustness.

Further investigations are carried out by comparing the proposed NARX-DNN

to the NARX network. The NARX network is trained by the Adam [83], based

on the series-parallel model in [21].

As shown in Table 3.5, the NARX network achieves much smaller RMSE

91

Table 3.2: Testing results of the ELM from 100 repeated prediction trials withdifferent hidden nodes number at different system stages

Hidden NodeNumber

ELM(RMSE e−2, SD e−6)RMSE1 SD1 RMSE2 SD2 RMSE3 SD3

120 7.571 42.30 7.716 19.98 9.420 17.34200 5.235 11.65 5.645 4.918 7.838 8.151280 4.112 1.091 4.946 0.234 6.812 5.698320 3.888 0.960 4.848 0.238 6.564 4.605400 3.624 0.508 4.733 0.113 6.167 2.570480 3.470 0.347 4.657 0.106 5.950 2.165560 3.362 0.270 4.589 0.083 5.770 1.306640 3.266 0.304 4.534 0.094 5.638 1.163720 3.181 0.292 4.482 0.083 5.525 0.792800 3.124 0.229 4.441 0.071 5.425 0.828

Table 3.3: Testing results of the DNN with output self-feedback from 100 repeatedprediction trials with different hidden nodes number at different system stages

Hidden NodeNumber

DNN1(RMSE e−2, SD e−6)RMSE1 SD1 RMSE2 SD2 RMSE3 SD3

120 2.969 0.052 4.910 0.191 5.619 0.167200 2.871 0.010 4.439 0.013 5.197 0.025280 2.713 0.045 4.369 0.012 5.017 0.069320 2.665 0.049 4.349 0.010 4.953 0.086400 2.589 0.044 4.316 0.015 4.854 0.092480 2.537 0.036 4.286 0.019 4.777 0.085560 2.488 0.031 4.260 0.018 4.707 0.087640 2.453 0.030 4.233 0.020 4.645 0.077720 2.420 0.026 4.209 0.018 4.599 0.071800 2.392 0.024 4.185 0.018 4.551 0.065

compared to the ELM when the hidden nodes number is small. However, when

the hidden nodes number becomes larger, the NARX network suffers from the

slow-convergence problem. The performance of the NARX is not improved by

adding hidden nodes. In contrast, the performance of the ELM and the proposed

NARX-DNN becomes much better.

It should be highlighted that the training of the traditional NARX networks

costs much longer time, compared to both ELM and the proposed NARX-DNN.

Since no recursive steps are involved in the training of the proposed NARX-

92

Table 3.4: Testing results of the proposed NARX-DNN from 100 repeated pre-diction trials with different hidden nodes number at different system stages

Hidden NodeNumber

NARX-DNN(RMSE e−2, SD e−6)RMSE1 SD1 RMSE2 SD2 RMSE3 SD3

120 2.969 0.058 4.910 0.217 5.616 0.200200 2.866 0.019 4.433 0.010 5.172 0.037280 2.688 0.039 4.350 0.015 4.822 0.099320 2.634 0.037 4.328 0.012 4.722 0.078400 2.556 0.038 4.287 0.016 4.597 0.075480 2.492 0.034 4.246 0.019 4.498 0.087560 2.441 0.051 4.212 0.029 4.420 0.065640 2.396 0.044 4.179 0.022 4.353 0.060720 2.349 0.038 4.143 0.025 4.286 0.065800 2.316 0.034 4.115 0.026 4.240 0.061

Table 3.5: Comparisons on averaged testing RMSE and traing time cost

Hidden NodeNumber

ELM NARX NARX-DNNRMSE Time/s RMSE Time/s RMSE Time/s

20 0.118 0.007 0.0450 27.29 0.052 0.05950 0.091 0.024 0.0439 43.61 0.051 0.105100 0.074 0.031 0.0443 110.51 0.049 0.119300 0.053 0.088 0.0446 189.89 0.042 0.251600 0.046 0.213 0.0449 241.50 0.041 0.464800 0.045 0.853 0.0453 293.40 0.040 1.880

DNN, it takes less than one second to obtain the optimal solution. In contrast,

the NARX network requires hundreds of times longer for training. In many real-

time applications, the low training efficiency of the model could be a big issue for

practical usage.

In Fig. 3.8, the engine dataset is divided into 10 subsets. 10-fold cross-

validations are performed by the ELM with 800 hidden nodes, NARX-BP with

50 hidden nodes and NARX-DNN with 300 hidden nodes. For each subset, 20

repeated prediction trials are carried out, with the rest 9 subsets as the training

data. The results show that the proposed NARX-DNN model achieves much

better performance.

93

Figure 3.8: Testing performance of 10-fold cross validation between ELM with 800hidden nodes, NARX-BP (series-parallel model) with 50 hidden nodes (trainedby Adam), and NARX-DNN with 300 hidden nodes. The solid lines connect theaveraged RMSE among 20 repeated simulations in the 10-fold cross validations.

Remark 3.5.2. The proposed NARX-DNN achieves excellent performance in

terms of both efficiency and robustness for the turbofan engine system modelling

problem. With the Monte Carlo based stochastic training and regularized batch

learning type of least squares method, the training process of the NARX-DNN has

achieved very fast speed and high accuracy. Compared to the ELM, the neural

network’s capability of capturing non-linear dynamic features is significantly im-

proved by the feedback loops. It suggests that the feedback loops may help the model

recover the information missed during the random projections of the inputs. Com-

pared to the BP trained NARX network, the proposed model has achieved much

faster training-speed and better prediction performance.

3.6 Conclusions

A NARX structure based dynamic modelling approach has been developed in this

chapter for generic dynamical systems and applied for the turbine engine system

prognostics. The proposed NARX-DNN has been designed so that the output of

94

the system is not only driven by the current system inputs but also dynamically

affected by its own previous states with different feedback propagating mecha-

nisms. Moreover, a fast training strategy has been developed for the proposed

model based on the stochastic training algorithm. The proposed model’s utility

has been demonstrated by using the NASA turbofan engine simulation datasets

where a number of simulated predictions have been carried out. The results have

shown that, with much smaller hidden nodes and faster training process, the

proposed NARX-DNN has achieved much better performance in turbine system

health condition predictions compared to the ELM and the BP trained NARX

networks. It is also noted that the proposed model better suits practical real-time

applications, where high training efficiency is required. Further work about the

stochastic training methods for sequence data and optimization techniques for

dynamical system modelling are under investigations.

95

Chapter 4

Learning from Back-forward

Stochastic Convolution for

Dynamical System Modelling

with Sequence Output

Well-explained knowledge is necessary for a good supervised learning procedure.

In traditional neural network-learning, the hidden layers of the network are all

generated from the input data by feature extraction. In this chapter, we look into

the feature extraction from both the output and input sides of neural networks,

where the output feature extraction can be considered as a knowledge explaining

procedure. For the dynamical system modelling problem with sequence input and

output, a back-forward stochastic convolution (BFSC) based learning scheme is

proposed. Then, the BFSC based extreme learning machine (BFSC-ELM) and

BFSC based echo state networks (BFSC-ESNs) are developed and applied to

Mackey-Glass system prediction, and turbofan engine degradation process mod-

elling using NASA’s simulated engine performance data. The application results

show that, compared to the original ESNs and ELM, the BFSC trained network

97

achieves much better performance for these two cases of application.

4.1 Introduction

Time-sequence can be considered as a series of observations and measurements,

which presents dynamical information of system evolution. To derive the system

developing trend from system measurements, dynamical system identification and

control by using signal data analytics and neural networks (NNs) has become a

very popular area in the past decade [21, 27, 53]. The main challenges of such

system modelling and prediction problem include time-varying system nonlinear-

ity, chaotic time-behaviours, and high-level noise. Especially, the multi-variate

dynamic time series prediction problem requires a model which is able to pro-

cess multi-variate input sequences and effectively capture the dynamical features

[171, 185]. It is noted that a single sampling data point with noise cannot suf-

ficiently represent system condition. Instead, time sequence consists of multiple

time-steps is normally used to represent the dynamical state. Similar to the time

series prediction problem in [186], which deals with predictions for a set of times-

pan system states, in this chapter, we focus on the dynamical system modelling

with sequence output.

The RNNs has been widely used to process time sequence data and detect

time-related behaviours [179]. Since 1980s, many RNNs have been developed,

e.g., Elman networks [41], Jordan networks [40], and nonlinear autoregressive

exogenous inputs (NARX) network [187]. Many successful real-world applications

are achieved by using RNNs [11, 45, 188]. However, to derive the dynamical time-

related features, back-propagation through time (BPTT) is normally used to train

the model. The time-consuming training problem has been recognized as a big

issue in some real-time applications [189].

98

In contrast, randomization based training technique has been developed for

the feedforward neural networks, e.g., Random vector functional-link (RVFL)

network [9], extreme learning machine (ELM) [10, 91], and stochastic configura-

tion networks [178]. The neural connections between input and hidden nodes are

randomly assigned, only the output weights are trained. Such learning scheme is

able to achieve very fast learning speed and exhibits very good performance in

various applications [87]. Furthermore, it has been theoretically proven that the

stochastic training process does not degrade the generalization capability of the

neural network [190].

Randomization-based training technique has also been developed for the re-

current networks, which is called reservoir computing (RC) [49]. The key part

of the RC architecture is the large reservoir which refers to the hidden mem-

ory of the RNNs. Echo state networks (ESNs), as one of the major framework

of the RC models, has been developed since 2001 [50]. Similar to the training

of ELM and RVFL network, the internal states of the ESNs are randomly gen-

erated, only the readout is trained. The ESNs shows excellent performance in

chaotic time series predictions [51, 52, 56], and non-linear system identification

[53]. It is reported that the ESNs is able to obtain better performance, compared

to traditional RNNs and long-short term memory networks, in some real-world

short-term prediction cases [55]. Benefit from the randomization process of the

input and internal weights, the non-linear training problem of the ESNs is trans-

ferred to a linear optimization problem. The training process thus becomes much

faster than BPTT based training. Without the iterative gradient descent, the

local minima and vanishing gradient problem are avoided.

In this chapter, we focus on the dynamical system modelling problem with

sequence output. Each short time sequence is considered as one dynamical sys-

tem state. Inspired by the human teaching process: good supervisors normally

explain knowledge well to help the student learn and understand better. In su-

99

pervised neural network training, well-designed/processed labels will also benefit

the learning procedure. In light of this, a back-forward stochastic convolution

(BFSC) based training approach is proposed for dynamical system modelling

with sequence output. Two new models are developed based on the ELM and

ESNs by applying the BFSC.

The rest of the chapter is organized as follows. In Section 4.2, brief descriptions

are given for the problem of nonlinear dynamical modelling with sequence output,

the stochastic convolution sampling based approximation, as well as the ELM and

ESNs models. In Section 4.3, the BFSC training scheme is developed and applied

to the ELM and ESNs. Section 4.4 presents the performance investigations of the

developed BFSC-ELM and BFSC-ESNs, with comparisons to the ELM and ESNs.

These models are applied for two dynamical system modelling cases: chaotic

time series problem and NASA turbofan engine system prognostics. Section 4.5

concludes the chapter.


In this section, the dynamical neural modelling problem with sequence output is

firstly introduced. Then the stochastic convolution sampling based approximation

and the ESNs model are discussed.

4.2.1 Dynamical Neural Modelling with Sequence Output

The NNs based dynamical system modelling problem is comprehensively intro-

duced in [21]. Dynamical system states are represented or determined by the

system current exogenous inputs and output feedback. To approximate system

nonlinear developing processes, many NNs based models with feedback structures

are designed to capture the system dynamical features. However, it is noted that

100

(a)

(b)

Figure 4.1: Examples of nonlinear dynamical system modelling with sequenceoutput. (a) sequence predictions of the Mackey-Glass system; (b) data visualiza-tion of a turbofan engine dynamical modelling problem: multi-variate sequenceinput and sequence output.

a single sampling data point with noise cannot sufficiently represent the system

condition. Instead, time sequence consists of multiple time-steps is normally used

to represent the dynamical state.

In this chapter, we focus on the dynamical system modelling problem with se-

quence output, which is more practical useful compared to the modelling problem

101

with single output. We generalize the problem as,

yt+1, ..., yt+m = N1(yt, ..., yt−l+1;ut, ...,ut−n+1) (4.1)

where, [ut, yt] is one system input-output pair, N1 : Rl+n → Rm is a neural net-

work function, representing the nonlinear dynamic plant. Especially, the output

of the model in (4.1) is assigned as a sequence, which can be considered as one

system dynamical state. An example of such dynamical modelling is shown in

Fig. 4.1b, which is a turbofan engine system prognostic problem. Typically, there

are two special cases of (4.1), formulated as,

yt+1, ..., yt+m = N1(ut, ...,ut−n+1) (4.2)

yt+1, ..., yt+m = N1(yt, ..., yt−l+1) (4.3)

where, (4.2) is a dynamical system modelling problem without output feedback,

(4.3) is a sequence to sequence time-series prediction problem. An example of

(4.3) is presented in Fig. 4.1a. It is noted that the system input and output feed-

back are both important for system modelling. Especially, the feedback sequence

might significantly improve the robustness and accuracy of the modelling.

Remark 4.2.1. It is a practical problem to choose the length of the output m,

input and feedback lag values, n and l. As mentioned, the output sequence with

length m is used to represent the system dynamical state and short-term trend.

The input and feedback lag values are designed to provide sufficient information

for the system modelling, normally, m ≤ n, l. Similar to the autoregressive (AR)

models, partial autocorrelation plot can be used to identify the order. In this

research, all the used models are organized with the same input and output length,

e.g., n, l = 2m, to validate our developed training scheme.

102

4.2.2 Approximation with Stochastic Convolution Sam-

pling

An integral ε-approximation of N (u) with multiple input can be formulated as

|N (u)−∫u∈U

f(u)du|≤ ε (4.4)

where, f : Rn → R is a nonlinear function, and u ∈ Rn is an input vector.

For the sequence input case, we can have a convolution sampling of the inte-

gral as hi(wi) = [fi(wiu1), ..., fi(wiuN)]T , where wi is ith convolution kernel, fi

represents the nonlinear dependence.

In neural network form, the function f can be represented by a general nonlin-

ear activation function g : R → R, and then fi can be formulated by g(bi), where

bi is a bias parameter. The nonlinear dependencies between different samplings

and target outputs can be approximated by appropriately selecting the nonlin-

ear function g, and bias bi. Then, one of the convolution results of the integral

sampling can be formulated as

hi = [g(wiu1 + bi), ..., g(wiuN + bi)]T (4.5)

Similar to Theorem 2.1 and 2.2 in [10], we can have

Theorem 4.2.1. Given an infinitely differentiable activation function g : R →

R, distinct training samples (U ,Y ), where U = [u1, ...,uN ], Y = [y1, ..., yN ],

ui ∈ Rn, and yi ∈ Rm. Based on (4.5), the feature matrix H = [h1, ..., hN ]T can

be generated by randomly selecting sampling kernels w = [w1, ..., wN ], wi ∈ Rn,

and bias vector b = [b1, ..., bN ] within certain interval and distribution, where N ≤

N ,wi ∈ Rn and bi ∈ R. The feature matrix H is invertible, and ∃ β ∈ Rm×N , for

∀ ε ∈ R+, we can have ‖βH − Y ‖< ε.

103

The proof of Theorem 4.2.1, when N = N , can be found in [10, 61]. Then,

an ε-approximation of N(u) can be formulated as

N (u) ≈N∑i=1

βig(wiu+ bi) (4.6)

Remark 4.2.2. Stochastic convolution sampling or randomization based approaches

have been successfully applied to train the neural networks [88], e.g., RVFL net-

work, ELM, and ESNs. The input weights of all these models are randomly

constrained and fixed within a certain range. As discussed in [89, 176], the ap-

proximation error will converge to zero when N → ∞. Such training scheme

can achieve very good performance when the hidden nodes number of the model

is sufficiently large.

4.2.3 The ESNs

In addition to the feedforward network in 4.6, the stochastic training based tech-

nique has also been developed for the RNNs training. As shown in Fig. 4.2, we

formulate the leaky integrator ESNs with output feedback in form of prediction

problem [59],

xt = winut +whht−1 +wfyt (4.7)

ht = (1− γ)ht−1 + γg(xt) (4.8)

yt+1 = f(βzt) (4.9)

where, g : R → R, and f : R → R are nonlinear differentiable functions, ut ∈ Rn

is the input vector, ht and ht−1 ∈ RN are the tth and (t − 1)th hidden state

vectors, yt ∈ Rl is the output feedback, yt+1 ∈ Rm is the output vector. The

104

Figure 4.2: An ESNs with multi-variate sequences input, sequence feedback andsequence output.

hidden state is determined by the system input, hidden state feedback and output

feedback, connected correspondingly with the input weights matrix win ∈ RN×n,

reservoir internal weights matrix wh ∈ RN×N , and output feedback matrix wf ∈

RN×l, which are all randomly selected. For the sequence input and feedback, all

those matrix can be considered as convolution sampling kernels. b ∈ RN is the

bias vector, β ∈ Rm×N is the output weight matrix, γ is called leaking parameter.

The condition of reservoir state asymptotical convergence is named as echo-

state property (ESP). The ESP of the networks is mainly determined by the

algebraic properties of the internal weight matrix and system input. Generally,

to make the network run under the ESP condition, the internal weights matrix is

randomly sparsely distributed, with spectral radius ρ(wh) < 1.

Remark 4.2.3. For sequence input, the internal states of the ESNs can be con-

sidered as the stochastic convolution sampling results of the system input, internal

105

feedback, and output feedback. The sampling filters are randomly selected under

the ESP. The internal weight matrix can be generated as follow,

wh = ws ·(δ/max(λws)

)(4.10)

where, ws ∈ RN×N is a sparsely uniformly distributed random matrix, λws is

the eigenvector of ws, and δ is a positive scalar which is normally selected as

smaller than one, to constrain the spectral radius of the internal weight matrix

to be less than unity. However, it is noted that ρ(wh) < 1 is considered as a

rather restrictive condition for the ESP [50]. Further discussion of the ESP can

be found in [57, 58].

4.3 The BFSC-based Learning Scheme

In this section, the BFSC based approaches are developed for the dynamical

system modelling with sequence input and output. The back-forward stochastic

convolution process is firstly presented based on the ELM model. Some theoretical

discussions and analysis are carried out. Then, the BFSC is further applied for

the ESNs. Different from the original ELM and ESNs, the developed models

are designed with output backwards feature extraction as a knowledge explaining

process to enhance the model learning capability.

4.3.1 BFSC-based Neural Modelling

Assume that we have N distinct training samples (U ,Y ), where U = [u1, ...,uN ],

and Y = [y1, ...,yN ], ut ∈ Rn and yt ∈ Rm represent the input sequences, and

output sequence at time t, respectively. Then, we can randomly generate the

106

input and output convolution results by BFSC as

xt = [w1ut,w2ut, ...,wN1ut]T (4.11)

and

st = [α1yt,α2yt, ...,αN2yt]T (4.12)

where, wi ∈ Rn(i = 1, 2, ..., N1) is ith input convolution kernel, N1 is the input

convolution kernel number, αj ∈ Rm(j = 1, 2, ..., N2) is jth output convolution

vector, N2 is the output convolution number. According to Theorem 4.2.1, all

those convolution kernels can be randomly selected in a certain range and distri-

bution to generate the feature representations of the input and output.

Similar to the approximation problem in (4.4), we would perform a nonlinear

approximation of the unknown system input-output mechanism. Similar to the

ELM model, we can obtain the hidden state matrix as H = [h1,h2, ...,hN ], where

ht = [g(w1ut + b1), g(w2ut + b2), ..., g(wN1ut + bN1)]T is the tth hidden feature

vector calculated from the tth input sample.

On the other hand, the output sequence can be processed as,

zt = F(

(st + v)/a0

)(4.13)

where, zt is the output feature vector of kth output sequence, F : R → R

is the inverse function of an activation function, e.g., F(x) = −ln(1/x − 1),

x ∈ (0, 1), if the activation function is chosen as f(x) = 1/(1 + e−x) . Bias

vector v = [v1, v2, ..., vN2 ] and scalar factor a0 are chosen to constrain the output

feature matrix S = [s1, s2, ..., sN ] within the definition domain of function F . For

example, to scale all the elements of the matrix within (0, 1), we can calculate v0

107

as,

v0 = −min(S + σ) (4.14)

where, σ is a small positive number. Bias vector v can be randomly generated

within [v0, v0+ε], ε is a small positive number, and S = [s1+v, s2+v, ..., sN +v].

Then we obtain the scalar factor a0 as,

a0 = max(S + σ) (4.15)

Given the output samples Y , we can obtain the output feature matrix Z =

[z1, z2, ...,zN ] based on (4.13). Then, the dynamical relationship between the

input and output sequences can be approximated by neural projection N : RN1 →

RN2 between input and output feature space. The input and output feature

matrices then become the training input and target.

One simple case of the BFSC based neural model is shown in Fig. 4.3, which

is a BFSC based ELM (BFSC-ELM) model with multi-variate sequences input

and sequence output. It is noted that, for the dynamical system modelling, the

output feedback sequence yt is also designed as one of the input sequences in Fig.

4.3.

After BFSC process shown in Fig. 4.3a, we can obtain the optimal solution

of the hidden weights matrix β ∈ RN2×N1 and the output weights matrix ω ∈

Rm×N2 . Then, for the dynamical system modelling problem, we have a four layer

network shown in Fig. 4.3b, which can be formulated as,

zt = βg(wut + b) (4.16)

yt+1 = ω[a0 · f(zt)− v] (4.17)

108

(a)

(b)

Figure 4.3: Sketch of the BFSC training scheme: (a) back-forward stochasticconvolution; (b) BFSC-based four-layer neural network.

where, ut consists of multiply input sequences and one output feedback sequence,zt

represents the estimation of output feature vector zt defined in (4.13). It is noted

that the input weights w, bias vectors b and v are all randomly selected, only

the hidden weights matrix β and output weight ω need to be optimized.

Remark 4.3.1. As discussed in [61], by constructing two hidden layer feedfor-

109

ward networks, the learning capability and storage capacity can be dramatically

improved. Different from [61], in this chapter we consider the BFSC based learn-

ing for the sequence input and output problem. As illustrated in Fig. 4.3, the

dynamical feature is firstly extracted from both the input and output sides. Then

the learning procedure is carried out between the input and output feature space.

The BFSC based learning procedure can be summarized as the following three

steps:

i. Input and output feature extraction by the BFSC which is considered as a

knowledge explaining process.

ii. Feature learning between input and output feature space by neural networks,

which is designed to capture the dynamical relationship of the input-output

sequences.

iii. Output recovery based on the learned knowledge.

4.3.2 Some Discussions on BFSC

Based on the research and investigations on the capabilities of a four-layered

feedforward neural network in [61], and ELM [91]. We would like to theoreti-

cally analyze the BFSC training in this section. For the BFSC learning process

formulated from (4.13) to (4.17), we can have

Theorem 4.3.1. Assume that we have an infinitely differentiable function F :

R → R, two full-rank matrix Y ∈ Rm×N which consists of N distinct vector

[y1, ...,yN ], and α = [α1, ...,αN2 ]T , αk ∈ Rm,m < N2 < N which is randomly

selected within certain interval and distribution, and αiyk 6= αiyk′ , for any vector

αi and ∀ k 6= k′, a bias vector v = [v1, ..., vN2 ] which is also randomly generated

within [v0, v0 + ε], vk 6= vk′ for ∀ k 6= k′, ε is a small positive real number, and

scalars v0 and a0 which are appropriately selected to make all the elements of

matrix S = (αY +v)/a0 be within the definition domain of F , then, we can have

110

a full-rank matrix Z = F(

(αY + v)/a0

), Z ∈ RN2×N with rank(Z) = N2, and

all the N column vectors of Z are distinct.

Z =

F(

(α1y1 + v1)/a0

)· · · F

((α1yN + v1)/a0

)...

. . ....

F(

(αN2y1 + vN2)/a0

)· · · F

((αN2yN + vN2)/a0

) (4.18)

Proof. We can arbitrarily take N2 column vectors from Z, which make up a sub-

matrix Zs ∈ RN2×N2 formulated as follow (we take first N2 column vectors of Z

as an example),

Zs =

F(

(α1y1 + v1)/a0

)· · · F

((α1yN2

+ v1)/a0

)...

. . ....

F(

(αN2y1 + vN2)/a0

)· · · F

((αN2yN2

+ vN2)/a0

) (4.19)

We can have z(vk) =[F(

(αky1 + vk)/a0

), ...,F

((αkyN2

+ vk)/a0

)]as kth row

of Zs, which can also be considered as a vector function based on vk. Similar to

the proof method in [61], we assume that rank(Zs) < N2, e.g., N2 − 1. Then,

there exists a vector c = [c1, c2, ..., cN2 ] which is orthogonal to the z(vk) subspace

with dimension N2−1. In other words, for ∀vk ∈ [v0, v0+ε], we have c orthogonal

to the vector z(vk)− z(v0). Thus, the assumption can also be written as,

c ·(z(vk)− z(v0)

)=

N2∑i=1

ci · F(

(αkyi + vk)/a0

)− c · z(v0) = 0 (4.20)

which can be formulated as,

F(

(αkyN2+ vk)/a0

)= − 1

cN2

N2−1∑i=1

ci · F(

(αkyi + vk)/a0

)+ c · z(v0)/cN2 (4.21)

Let dN2 = αkyN2/a0, γi = −ci/cN2 , di = αkyi/a0, δi = di− dN2 , o = c ·z(v0)/cN2 ,

111

v ∈ [dN2 + v0/a0, dN2 + (v0 + ε)/a0], we can reformulated the equation as,

F(v) =

N2−1∑i=1

γi · F(v + δi) + o (4.22)

Since the nonlinear function F is infinitely differentiable, thus we can easily

calculate the derivative of F with respective to v as

F ′(v) =

N2−1∑i=1

γi · F′(v + δi) (4.23)

F (2)(v) =

N2−1∑i=1

γi · F (2)(v + δi) (4.24)

F (l)(v) =

N2−1∑i=1

γi · F (l)(v + δi) (4.25)

where, F (l)(v) is the lth derivative of the function F with respect to v. When l >

N2, the simultaneous equations (4.22)-(4.25) are failed to reach a solution, since

there are only N2 parameters, γ1, γ2, ..., γN2−1, and o. It shows a contradiction to

the assumption in (4.20), which suggests that z(vk) belong to a subspace whose

dimension is bigger than N2 − 1. Thus, rank(Zs) = N2. Since Zs is a part of

the matrix Z ∈ RN2×N , N2 = rank(Zs) ≤ rank(Z) ≤ N2, thus rank(Z) = N2.

Furthermore, since all the column vectors of Zs are randomly selected from Z,

for any N2 column vectors in Z, we can obtain a full-rank matrix Zs ∈ RN2×N2 ,

which means that the N column vectors in Z are distinct.

In the BFSC training process, the matrix Z is designed as the training target

for the first three layer network. According to the Theorem 4.2.1, by taking U

as the input matrix, and Z as the output target matrix, which both contains N

distinct samples, we can obtain β ∈ Rm×N which makes ‖βH − Z‖< ε, ε is a

small positive number. Based on that, we can have an estimation of the output

112

feature matrix

Z = βH (4.26)

and the output convolution matrix S = [s1, ..., sN ] can be recovered by the inverse

process of (4.13) as

st = 1/a0 · f(zt)− v (4.27)

After that, we can further recover the output matrix Y by a deconvolution

process of (4.12). The optimization process of the hidden weights and output

weights is formulated in the following section.

4.3.3 BFSC-ESNs

To effectively solve the dynamical system modelling problem with sequence input

and output, a BFSC-ESNs and its training approach are developed in this section.

The BFSC-ESNs can be formulated as,

xt = winut +whht−1 +wfyt (4.28)

ht = (1− γ) · ht−1 + γ · g(xt) (4.29)

yt+1 = ω[1/a0 · f(βht)− v] (4.30)

where, ht−1 and ht ∈ RN1 are the hidden state vectors at time t − 1 and t

correspondingly, win ∈ RN1×n is the input weights matrix, wh ∈ RN1×N1 is

the reservoir internal weights matrix, wf ∈ RN1×l is the output feedback weights

matrix, β ∈ RN2×N1 is the feature weight matrix which connects the input feature

space and output feature space, a0 and v are scaling parameters defined in (4.14)

113

and (4.15), ω ∈ Rm×N2 is the output weight matrix. All the other variables follow

the definitions in Section II.C.

A set of ”washout” training data is applied for the model training initial-

ization. With the training samples (U ,Y ), where U = [u0, ...,uN−1], and

Y = [y1, ...,yN ]T , ut ∈ Rn and yt+1 ∈ Rm represent the tth input and output

sequence, the internal state matrix X = [x1, ...,xN ] can be iteratively generated

as in (4.28) by randomly selecting the input weights matrix win, hidden feedback

weights matrix wh, and output feedback weights matrix wf . Then, the hidden

states matrix H = [h1, ...,hN ] can be obtained iteratively as in (4.29).

On the other hand, the output convolution matrix can be generated as S =

[s1, ..., sN ], where st ∈ RN2 is obtained by backward stochastic convolution of

the output training samples Y based on (4.12). Then, we can have the processed

output feature matrix Z = [z1, ...,zN ], where zt ∈ RN2 is calculated as (4.13).

To evaluate the sequential input-output dynamical relationship, the following

optimization problem is formulated with Tikhonov regularization.

min{γ1

2‖εz‖2+

d12‖β‖2

}(4.31)

s.t. εz = Z − βH (4.32)

where, γ1 and d1 are the positive real regularization parameters, β is the feature

projection weights matrix formulated as,

β =

β11 β12 · · · β1N1

β21 β22 · · · β2N1

......

. . ....

βN21 βN22 · · · βN2N1

(4.33)

114

And the error term εz is written as

ε =

εz11 εz12 · · · εz1N

εz21 εz22 · · · εz2N...

.... . .

...

εzN21εzN22

· · · εzN2N

(4.34)

where, εzjt = zjt −∑N1

i=1 βijhit represents the distance between the jth output

feature and the evaluated feature value from the ESNs.

To solve the minimization problem in (4.31), we can have the Langrange

multiplier as,

L1(εz,β,λ) =γ12

N∑t=1

N2∑j=1

ε2zjt +d12

N1∑i=1

N2∑j=1

β2ij

−N∑t=1

N2∑j=1

λjt(zjt −N1∑i=1

βijhit − εzjt) (4.35)

According to the Kuhn-Tucker conditions, the following conditions are neces-

sary for the optimality of (4.31).

∂∂εzkjL1(εz,β,λ) = 0

∂∂βijL1(εz,β,λ) = 0

∂∂λkjL1(εz,β,λ) = 0

(4.36)

Solving (4.36), we can have,

γ1εzjt + λjt = 0

d1βij +∑N

i=1 λjthit = 0

zjt −∑N1

i=1 βijhit − εzjt = 0

(4.37)

115

With the training samples (U ,Y ), a batched formulation of (4.37) can be written

as, λ = −γ1εz

d1β = −λHT

εz = Z − βH

(4.38)

Solving (4.38), we can have the following solution for the problem (4.31),

β = ZHT (d1/γ1I +HHT )−1 (4.39)

where I is an identity matrix. Then, the estimation of the output feature matrix

Z can be obtained as (4.26). Based on (4.27), we can have the estimated output

convolution matrix S = [s1, ..., sN ]. After that, the output sequence can be

recovered by solving the following optimization problem.

min{γ2

2‖ε‖2+d2

2‖ω‖2

}(4.40)

s.t. ε = Y − ωS (4.41)

Similar to (4.31)-(4.39), we can have the corresponding Lagrange multiplier as

L2(ε,ω,λ) =γ22

N∑t=1

m∑p=1

ε2pt +d22

m∑p=1

N2∑j=1

ω2jp

−N∑t=1

m∑p=1

λpt(ypt −N2∑j=1

sjtωjp − εpt) (4.42)

And the solution becomes,

ω = Y ST (d2/γ2I + SST )−1 (4.43)

116

It is noted that the BFSC learning problems (4.31) and (4.40) are both in

quadratic form, which can be easily solved by two batched steps (4.39) and (4.43).

The slow-convergence and local minima problems are avoided. The network train-

ing thus becomes much more efficient compared to the gradient descent based

training.

4.4 Performance Investigation

In this section, two dynamical modelling and prediction cases are used to validate

the developed BFSC training based models, which are Mackey-Glass sequence

prediction, and NASA turbofan engine prognostic problem. Four different models

are compared in solving these two problems: the ELM, ESNs, BFSC-ELM, and

BFSC-ESNs. All the simulations are carried out in MATLAB 9.2 environment,

using a computer with an Intel Core i5-4590, 3.30 GHz CPU.

4.4.1 Mackey-Glass Sequence Prediction

Mackey-Glass system was firstly proposed in 1970s [61], to deal with the physio-

logical disorders in dynamical respiratory and hematopoietic diseases. It is known

for complex dynamics including chaos. In [51], the ESNs model is applied to per-

form the chaotic time series prediction, which achieved significant improvement

compared to traditional methods. The Mackey-Glass equation is shown as follow,

dytdt

= βyτ

1 + ynτ− ξyt β, ξ, n > 0 (4.44)

where, β is the system production constant rate, ξ is system constant decay rate,

yτ indicates the y variable at time t − τ , and n is the exponent number. In

this demonstration, those real number is selected as, γ = 1, β = 2, n = 9.65,

117

and τ = 2. In this section, we focus on the Mackey-Glass sequence-to-sequence

prediction as (4.3). The input length of all the used models is selected as 200, and

the output length is selected as 100. Fig.4.4b shows the comparisons between four

different models on the Mackey-Glass sequence predictions. The hidden nodes

number of all these four models are selected as 400. For the BFSC-ESNs and

BFSC-ELM, the backwards convolution number is chosen as 200.

(a)

(b)

Figure 4.4: Visulization of Mackey-Glass sequence predictions: (a) plots of re-peated predictions by BFSC-ESNs with network structure 200-400-200-100. (b)RMSE comparisons of the sequence prediction (100 repeated trials) by fourstochastic training based models on the same training dataset and test sequence.

118

Table 4.1: Averaged testing results of four different models on Mackey-Glasssequence predictions (100 repeated trails)

HiddenNode

Number

ELM/e−3 BFSC-ELM/e−3 ESNs/e−3 BFSC-ESNs/e−3

RMSE STD RMSE STD RMSE STD RMSE STD50 39.5 5.02 18.1 5.15 46.1 8.51 14.2 4.69100 22.9 7.40 10.0 3.04 30.8 8.97 8.37 3.15200 6.35 1.73 4.19 1.38 11.2 3.58 3.86 1.13300 3.85 0.87 2.69 0.79 4.69 1.62 2.17 0.74400 2.88 0.72 1.96 0.48 2.83 0.77 1.53 0.43500 2.45 0.46 1.70 0.43 2.05 0.48 1.05 0.27600 2.11 0.51 1.51 0.32 1.58 0.38 0.89 0.25700 1.89 0.37 1.43 0.31 1.33 0.35 0.74 0.20800 1.75 0.38 1.24 0.25 1.13 0.30 0.64 0.15

The testing root mean square error (RMSE) is calculated for each prediction

point. The predictions of four different models are repeated for 100 times, shown

in Fig.4.4b. It is noted that, with the help of the BFSC training schemes, the

sequence prediction performance of ELM and ESNs is significantly improved.

In addition, the simulation results of these four models with different hidden

nodes number are presented in Table 4.1. The RMSE is the sequence averaged

value from 100 repeated predictions for each model. The standard deviation

(STD) is also calculated for each hidden nodes setting. The backwards convolu-

tion number is fixed as 200 during the simulations for Table 4.1. It is noted that,

for each hidden nodes setting, the BFSC based models perform much better than

the original ones in terms of both accuracy and robustness.

Remark 4.4.1. When the hidden nodes number or the reservoir size is small,

the randomized trained models are unable to perform a good modelling. However,

by using the BFSC training, the models achieve much better performance than the

original ones. It suggests that the output backwards stochastic convolution helps

the supervised learning process for the sequence prediction problem. Furthermore,

the training time cost of the BFSC based models is similar to the original ones,

119

Figure 4.5: The training time costs with different hidden nodes number for theMackey-Glass sequence predictions (100 trails).

shown in Fig. 4.5. A very efficient training process is achieved for the four-layer

neural network shown in Fig. 4.3b.

4.4.2 Turbofan Engine System Prognostic

In this section, the developed models are applied for the turbofan engine sys-

tem prognostic problem. The turbofan engine units’ run-to-failure dataset was

collected from NASA’s C-MAPSS simulation environment [158]. The dataset

FD001 is used in this study, which contains data from 100 turbofan engine units,

simulated under different system operating conditions. The major input mea-

surements include temperature and pressure at different positions of turbines,

fan speed, and the rate of fuel flow. Details of the dataset description can be

found in [157].

Based on the run-to-failure sensory dataset, we can easily evaluate the health

index (HI) by using an ELM model formulated as follows [181],

yt =N∑i=1

βig(wiut + bi) (4.45)

120

Figure 4.6: Visualization of the NASA turbofan engine sensory measurementsand the corresponding evaluated system HI sequence.

where, ut is the input vector consists of multiple sensory sequences, yt is the

corresponding system HI at time t. In our simulation, the hidden nodes number

is selected as 1000 to perform the HI estimation. The model is trained by the

initial and terminal sensory measurements, which are labelled by HI value 1 and 0

correspondingly. After the training, the whole HI sequences can be generated by

organizing the whole sensory dataset as the input. An example of the generated HI

sequence is shown in Fig. 4.6. Although the estimated HI is very noisy, it clearly

shows the nonlinear system deterioration process. Thus, this generated dataset of

HI is used to validate our proposed training scheme. Then, the turbofan engine

system prognostic becomes a nonlinear dynamical system modelling problem, as

formulated in (4.1).

The ELM with NARX structure, ESNs, BFSC-ELM, and BFSC-ESNs models

are used to predict the future HI sequence, based on the sensory sequences and

HI feedback sequence. Following the discussion in Remark 4.2.1, in all the used

121

models, the length for each input sensory sequence and HI feedback is selected

as 20. The length of the output sequence is chosen as 10. It is noted that the

sequence with 10 data points is considered as one dynamical system state. All

the applied models are given with the same system information to solve a second

order dynamical system modelling problem.

The predictions by the developed BFSC-ESNs is illustrated in Fig. 4.7. It

shows that the prediction results are very close to the desired HI sequences at dif-

ferent engine developing stages. The small prediction variation of repeated trails

suggests that the developed BFSC-ESNs achieves very good modelling robustness

under different random initializations.

Similar to the Mackey-Glass prediction case, we further investigate the mod-

elling performance with different hidden nodes number for this engine system

prognostic problem. As shown in Fig. 4.8 and Table 4.2, the ELM model with

NARX structure achieve similar performance to the ESNs. However, the BFSC

based models achieve much better accuracy in terms of RMSE compared to the

ELM and ESNs. It needs to be emphasized that the STD become much smaller

for the BFSC-based models, by adding the output backwards convolution layer,

which suggests that the BFSC training process also improve the robustness of the

models against the noise.

Remark 4.4.2. According to Table 4.2, the performance of all the models are im-

proved when the hidden nodes number is increased from 50 to 1000. However, it is

noted that the BFSC based models with network structure 220-300-100-10 perform

better than the ELM and ESNs with network structure 220-1000-10. Meanwhile,

as shown in Fig. 4.9, the training time cost of the BFSC models is similar to the

original ELM and ESNs with same hidden size, which suggests that, by adding the

output backward convolution, the learning capability of the model is significantly

improved. The modelling efficiency is also enhanced by the backwards knowledge

explaining process.

122

(a)

(b)

(c)

Figure 4.7: Examples of turbofan engine system HI sequence predictions at dif-ferent system stages by BFSC-ESNs with network structure 220-400-100-10 (50repeated trails): (a) initial system stage; (b) intermediate system stage; (c) failurestage.

123

Figure 4.8: Training and testing RMSE of four stochastic trained models for theNASA turbofan engine prognostic problem, under different hidden nodes numbersettings.

Table 4.2: Averaged testing results of four different models on NASA turbofanengine data with 100 repeated trails (the backward convolution number is set as100 for the BFSC based models is this test)

HiddenNode

Number

ELM/e−2 BFSC-ELM/e−2 ESNs/e−2 BFSC-ESNs/e−2

RMSE STD RMSE STD RMSE STD RMSE STD50 8.07 0.619 6.15 0.266 7.81 0.592 5.92 0.529100 6.61 0.335 5.23 0.142 6.35 0.348 5.09 0.237200 5.57 0.194 4.37 0.065 5.33 0.168 4.34 0.137300 5.00 0.121 4.06 0.036 4.79 0.122 4.04 0.066400 4.68 0.086 3.97 0.028 4.52 0.078 3.96 0.032500 4.51 0.074 3.94 0.024 4.37 0.068 3.92 0.023600 4.39 0.062 3.91 0.021 4.29 0.058 3.90 0.027700 4.32 0.050 3.90 0.021 4.23 0.045 3.89 0.026800 4.26 0.050 3.89 0.021 4.19 0.043 3.89 0.026900 4.23 0.042 3.88 0.023 4.18 0.042 3.88 0.0271000 4.22 0.046 3.88 0.024 4.18 0.042 3.88 0.027

124

Figure 4.9: The training time costs four stochastic trained models with differenthidden node number for NASA turbofan engine system modelling (100 repeatedtrails).

4.5 Conclusions

In this chapter, a BFSC based training scheme has been proposed. The BFSC-

ELM and BFSC-ESNs models have been developed for the dynamical modelling

problem with sequence output. The developed models have been compared to

the ELM with NARX structure and the ESNs on two dynamical system mod-

elling cases: Mackey-Glass sequence prediction, and turbofan engine system prog-

nostics. The simulation results have shown that, by introducing the backwards

stochastic convolution process to the neural network learning, the modelling ca-

pability and efficiency have been significantly improved for the sequence output

problem. For future work, the authors will further investigate the information

extraction mechanism in human-being daily life.

125

Chapter 5

Neural Network Learning-based

Sinusoidal Structured Modelling

for Gear Tooth Cracking Process

In this chapter, a neural network learning based sinusoidal structured modelling

scheme is developed for the gear tooth cracking process modelling and prognos-

tics. According to the gear mesh vibration signal characteristics, a sinusoidal

structured model is firstly proposed. The signal model is constructed as a rep-

resentation of the gear mesh vibration signal that consists of three part: the

harmonic signal part that is used to express the vibration caused by the gear

normal operation, the residual periodic signal part which is caused by the equip-

ment wear and affected by system load, and the residual non-periodic part that

is induced by system disturbances and gear tooth crack. Inspired by the learning

process of the neural networks, the structured signal model is trained by using the

Adam stochastic gradient descent based BP. The developed sinusoidal structured

learning scheme is applied for the gear tooth cracking process modelling with a

set of gear vibration signal data. The excellent performance of the developed

method indicates that different parts of the vibration signal are decomposed, and

127

the gear tooth cracking process can be successfully represented by the energy

distribution of the extracted residual signal.

5.1 Introduction

System monitoring and prognostics of industrial systems are essential for im-

proving system safety and availability. As one of the most common elements in

mechatronic systems, the gear plays a crucial role during the system operation.

According to the literature, the vibration sensory data analysis based strate-

gies are normally applied to detect the health conditions of the gear. However,

the vibration signals detected from the operating machine are always complex.

The useful system information is mixed with the noise, disturbances, and non-

stationary system components induced by the operation environment, which mo-

tivates us to develop effective and intelligent approaches for gear fault diagnosis

and prognostics.

During the last decade, the gear fault diagnosis and prognostics have attracted

a lot of attention from the related researchers and engineers [161, 170, 191]. Many

signal processing based techniques have been developed for the gear vibration

signal modelling and fault detections, e.g. synchronous signal average (SSA)

[192], time-frequency analysis [193], resonance demodulation technique [194, 195],

wavelet transform [196], and spectral kurtosis analysis [197, 198]. These signal

processing methods are designed with feature extraction schemes to capture the

useful information from the measured vibration signals for the gear diagnosis. An

optimal sinusoidal modelling approach with amplitude and phase modulations is

developed for gear diagnosis and prognostics in [168]. With the developed opti-

mization technique based sinusoidal modelling scheme, the crack-induced impulse

vibration signal was extracted, and the gear tooth cracking process has been suc-

128

cessfully monitored.

In addition, with the rapid development of the machine learning and artificial

neural networks (ANNs) techniques [65, 199, 200], many ANNs based approaches

have been employed for fault diagnosis and prognostics [201, 202]. In [11], a

dynamical neural network with self-feedback is developed for the crack growth

process modelling of ductile alloys. It is noted that, due to the strong non-linear

approximation characteristics, the ANNs can help to discover important features

for fault diagnosis from noisy sensory data. However, the ANNs based method is

commonly developed as a ’black-box’, without much consideration of the physical

system and signal properties. In many cases, the signal characteristics are very

important for system states evaluations and fault detections. In addition to the

data fitting based methods, signal properties and model structures are worthwhile

to be investigated further to obtain better modelling performance.

In this chapter, a neural network learning based vibration signal modelling

scheme is developed for the gear tooth cracking prognostics. Different to the

modulated sinusoidal model proposed in [168], in this chapter, the signal model

is constructed as a superposition of three different terms, which are designed

based on the characteristics of the gear vibration signal. The developed method

is also different from the residual signal analysis in [161], after removing the

harmonics, the residual signal is further investigated. The Fourier series is used

to model the periodic modulation signals which are caused by the equipment wear

and loading changes. The non-periodic signal caused by gear crack is captured,

after extracting all the periodic vibration signals. In addition, inspired by the

neural network learning process [76, 203], the back-propagation (BP) methods is

applied for the model optimization with the Adam stochastic gradient descent

[83]. Furthermore, the developed scheme is validated by using a set of gear rig

test data [204].

The rest of this chapter is arranged as follows. In Section 5.2, a gear mesh

129

vibration signal model is firstly introduced. Then, a sinusoidal structured signal

model is constructed as a superposition of three parts: the main vibration signal

harmonics, the residual periodic signal, and the residual non-periodic signal. In

Section 5.3, the neural network learning based modelling process is presented.

In Section 5.4, the developed neural network learning based signal modelling

approach is tested with a gear vibration signal dataset. Section 5.5 concludes the

chapter.


The gear mesh vibration signal normally consists of several signal frequency com-

ponents which are induced by different system mechanisms. Thus, it is challeng-

ing to capture the features which reflect the gear tooth cracking. In this section,

the gear mesh vibration signal modelling is introduced, followed by discussions

of an optimal sinusoidal modelling scheme. Based on that, a simple sinusoidal

structured signal model is constructed for the gear mesh vibration signal.

5.2.1 Gear Mesh Vibration Signal Modelling

The gear mesh vibration signal is mainly dominated by several harmonic fre-

quency components, which contain information about operation conditions, states

of mechanical elements, vibration-related system faults, and environmental dis-

turbances. According to [160], a gear vibration signal can be modelled as a

superposition of a Fourier series and a residual term, shown as follow,

yt =N∑i=1

[Aicos(2πfit) + Bisin(2πfit)

]+ δt (5.1)

130

where, i is the meshing harmonic number index (i = 1, 2, ..., N), Ai and Bi are

the amplitudes of sinusoidal components of the ith harmonic frequency fi. As

discussed in [168], the frequencies of the gear vibration signal harmonics can be

decided by the gear teeth number. More importantly δt is the residual error term,

which represents all the residual vibration signal apart from the harmonic part.

Remark 5.2.1. The residual term δt is regarded as a white noise signal in [160],

which could be inappropriate in many cases. In addition to the periodic signal

part modelled by the Fourier series in (5.1), there could be non-periodic vibration

signals, e.g. the impulse vibration signal induced by gear crack. Thus, the residual

signal may contain very important information that could reflect the gear wear

and cracking process. Thus, in this chapter, the residual signal which is further

studied.

We can rewrite (5.1) by using trigonometric functions as follow,

yt =N∑i=1

[Cicos(2πfit+ βi)

]+ δt (5.2)

where, the amplitude Ci and the phase βi are defined as

Ci =√A2i +B2

i (5.3)

and

βi = arctan(−Bi/Ai) (5.4)

However, as discussed in [168], the amplitude and phase of each signal fre-

quency component may be changed or modulated with the evolution of the sys-

tem. Thus, for the vibration signal modelling, the amplitude-modulation term cit

and phase modulation term θit can be introduced to the signal model. Thus, the

131

vibration signal in (5.2) with modulations can be written as,

yt =N∑i=1

Ci[1 + cit]cos(2πfit+ βi + θit) + δt (5.5)

where,

cit =

p∑j=1

[c(1)ij cos(2πjfst) + c

(2)ij sin(2πjfst)

](5.6)

θit =

p∑j=1

[θ(1)ij cos(2πjfst) + θ

(2)ij sin(2πjfst)

](5.7)

and

δt = dt

q∑k=1

[δ(1)k cos(2πkfst) + δ

(2)k sin(2πkfst)

](5.8)

where, c(1)ij , c

(2)ij , θ

(1)ij , θ

(2)ij , δ

(1)k , and δ

(2)k , i ∈ {1, 2, ..., N}, j ∈ {1, 2, ..., p}, and k ∈

{1, 2, ..., q} are model constant parameters, fs is the gear-shaft rotation frequency,

dt is the envelop function of the residual signal δt. For the localized gear fault

induced vibration, dt can be designed in the form of Gaussian function as,

dt = d0e(−(t−µ)2/σ2) (5.9)

where, d0, µ, and σ are constant coefficients, which represent, the height of the

function, center position of the peak, and the standard deviation, respectively.

It is noted that both the amplitude and phase modulations are assumed to be

periodic. And the residual signal is assumed to be Gaussian distributed. Both

of the assumptions are based on the prior knowledge and observation of the gear

vibration signal shown in Fig. 5.1. Then, based on the signal modelling from

(5.5) to (5.9), we can have the following gear vibration signal model.

132

yt =N∑i=1

Ci

[1 +

p∑j=1

[c(1)ij cos(2πjfst) + c

(2)ij sin(2πjfst)]

]cos(

2πfit+ βi

+

p∑j=1

[θ(1)ij cos(2πjfst) + θ

(2)ij sin(2πjfst)]

)+ d0e

(−(t−µ)2/σ2)q∑

k=1

[δ(1)k cos(2πkfst) + δ

(2)k sin(2πkfst)

](5.10)

Remark 5.2.2. For gear vibration signals with different faults and characteris-

tics, the model in (5.5) might not be suitable. Furthermore, with the modulations

of amplitude and phase, the complete model presented in (5.10) is complicated. To

effectively obtain the solution, some assumptions are used to simplify the model

[168], which may affect the signal modelling performance. All these motivate us

to develop a simple and efficient modelling scheme for the gear mesh vibration

signal modelling.

5.2.2 Sinusoidal Structured Signal Model

Three gear vibration signal samples from different gear cracking developing stages

and the corresponding frequency spectrum are shown in Fig. 5.1. The signals

were collected from a gear test rig which was conducted by Swinburne University

of Technology and Australian Defence Science and Technology group (DSTO).

In Fig. 5.1(a), the signal at G6b-30 is approximately periodic. With the system

development and gear wear, the signal is significantly changed when it is at G6b-

100 shown in Fig. 5.1(c). More obviously, a non-periodic signal is observed

around 200 degrees of the shaft angle at G6B-110 shown in Fig. 5.1(e). It is the

time when the crack was first detected. In contrast, only a little differences can

be observed from the frequency spectrum, as shown in Fig. 5.1 (b), (d), and (f).

Different to the specially designed signal model in (5.10), in this chapter, the

133

Figure 5.1: The synchronous averaged signals yt at (a) G6b.30; (c) G6b.100; (e)G6b.110; and the frequency spectrum at (b) G6b.30; (d) G6b.100; (f) G6b.110.

following signal model is constructed.

yt = ht + rt + εt (5.11)

where, the signal yt is designed to consist of three part, the harmonic frequency

part ht, residual periodic part rt, and the second residual error term εt.

134

Remark 5.2.3. The residual error term δt in (5.5) is investigated, according to

the gear mesh vibration signal characteristics. As addressed in [168, 170], the

envelope of the gear crack induced impulse vibration signal could be modelled by

a log-normal function or Gaussian function, which is non-periodic. However,

for an unknown impulse signal, it is hard to obtain the appropriate formulation.

Thus, in this chapter, we divide the residual signal into two parts. One part can

be structured in a periodic form, the final residual part is left as the unknown

non-periodic signal.

Similar to (5.5) the harmonic frequency part ht is designed by the Fourier

series as,

ht =N∑i=1

[Cicos(2πfit+ βi)

](5.12)

where, Ci and βi represent the amplitude and phase of the ith harmonic frequency

component, N is the order number which is pre-defined according to the frequency

spectrum, fi is the gear mesh frequency, which can be determined by the gear-

shaft rotational frequency fs and the gear tooth number Ne as,

fi = i ·Ne · fs (5.13)

In addition to the harmonic frequency part, the residual signal δt in (5.5) is

divided into two parts rt and εt. The periodic residual part rt is defined as,

rt =

p∑j=1

[ajcos(2πfsj t) + bjsin(2πfsj t)

](5.14)

where, j ∈ 1, 2, ..., p is the frequency index of the residual periodic signal, aj and

bj are the amplitudes of the jth residual sinusoidal component with frequency

fsj , which could be derived by performing the Fourier transform on the residual

signal δt.

135

It is noted that the second residual error term εt is designed for the non-

periodic signal. For the gear mesh vibration signal, the non-periodic vibration is

normally caused by faults like gear crack. In this chapter, we assume that the

characteristics of the vibration signal induced by gear crack are unknown. During

the modelling, by giving the measured the gear vibration data yt, the non-periodic

residual signal can be simply obtained as,

εt = yt − ht − rt (5.15)

where, ht and rt are the estimated harmonic part and the estimated periodic

residual part, which can be obtained by a learning procedure.

In summary, the complete structured gear vibration signal model in one com-

plete revolution can be formulated as,

yt =N∑i=1

[Cicos(2πfit+ βi)

]+

p∑j=1


]+ εt (5.16)

where the gear mesh vibration signal model is structured with three parts. The

first part is designed as the harmonic frequency ht, which is the dominant part of

the gear vibration signal at gear initial healthy stage. The harmonic frequencies

are determined by the shaft-rotation frequency and gear tooth number, which

are assumed to be pre-known and periodic. The second part is structured as the

residual periodic signal, which may be induced by the gear wear and unknown

effects from the operational environment. The frequencies of the residual periodic

signal are normally unknown, which are required to be extracted from the residual

term δt. The last part is the non-periodic part which is considered as the vibration

caused by the gear cracks and damages.

Remark 5.2.4. Different to the gear mesh vibration model formulated in (5.10),

designed with amplitude and phase modulations [168], the model (5.16) is straight-

136

forwardly structured with three decomposed signal parts, which covers the known

system knowledge ht, unknown observable signal rt, and unknown hardly observ-

able information εt. Since the signal characteristics of non-periodic crack induced

signal part differs from the other parts of vibration signal, it can be captured in

the final residual error term εt by appropriately designing the structured learn-

ing process. The neural network learning based training scheme for the vibration

signal modelling will be discussed in the following section.

5.3 The Neural Network Learning-based Struc-

tured Sinusoidal Modelling

In this section, the neural network learning based sinusoidal structured modelling

procedure of the model in (5.16) is presented. Inspired by the neural network

learning [76], the Adam stochastic gradient descent based BP method is applied

for the signal modelling process.

As shown in (5.16), with the predefined architecture, the model is designed to

storage three different kinds of knowledge from the vibration signals. Assume that

the vibration signal of one gear rotation revolution at time period k is recorded

as yk = [yk1, yk2, ..., ykT ], T is the sample number of one complete gear mesh

revolution, then, the objective of the modelling task can be formulated as,

minγ

2‖εk‖2+

d

2D(C,β,a, b)

s.t. εk = yk − yk (5.17)

where, D(·) is a regularization function, C = [C1, C2, ..., CN ] is the harmonic

amplitude parameter vector, β = [β1, β2, ..., βN ] is the harmonic phase parameter

vector, a = [a1, a2, ..., ap] and b = [b1, b2, ..., bp] are constant coefficients for

137

residual periodic part model, yk = [yk1, yk2, ..., ykT ] is the output vector of the

model defined in (5.16), with t ∈ 1, 2, ..., T .

However, it is noted that the model designed in (5.16) is still complex, which

is difficult to reach the optimal solution. In this section, we investigate a step-

by-step structured learning process. Instead of solving the problem in (5.17), the

harmonic part of the model ht is firstly used to capture the normal operational

information from the measured vibration signal sequence yk = [yk1, yk2, ..., ykT ].

minγ12‖εhk‖2+

d12‖C‖2+d2

2‖β‖2

s.t. εhk = yk − hk (5.18)

where, hk = [hk1, hk2, ..., hkT ], hkτ is the output of harmonic component part

model in (5.12), when t = τ , d1 and d2 are the regularization parameters.

Inspired by neural network learning, to efficiently obtain the optimal solution

for (5.18), the Adam stochastic gradient descent based BP is applied. The process

is presented as follows.

1. The parameter vectors C0 and β0 are randomly initialized within a small

range. The initialized first moment terms vc0 = [vc01, vc02, ..., vc0N ], and vβ0 =

[vβ01, vβ02, ..., vβ0N ], the second moment terms mc0 = [mc01,mc02, ...,mc0N ], and

mβ0 = [mβ01,mβ02, ...,mβ0N ] are all 0 vectors. The step parameters for moment

terms are selected as αv1 = αv2 = 0.9, and αm1 = αm2 = 0.999. The iteration

index s is initialized with 1.

2. Randomly select one training sample ykt, t ∈ 1, 2, ..., T , and compute the

corresponding model output hkt at iteration s as,

hkt =N∑i=1

[Csicos(2πfit+ βsi)

](5.19)

138

3. Compute the regularized cost function value as,

L1(C;β) =γ12

(ykt − hkt)2 +d12

N∑i=1

C2si +

d22

N∑i=1

β2si (5.20)

4. Compute the derivatives of the objective function with respective to Csi,

and βsi, i ∈ 1, 2, ..., N , at iteration s,

∂L1

∂Csi= −γ1(ykt − hkt) · cos(2πfit+ βsi) + d1Csi (5.21)

∂L1

∂βsi= γ1(ykt − hkt) · Csi · sin(2πfit+ βsi) + d2βsi (5.22)

5. Update the first moment terms at iteration s,

vcsi = αv1 · vc(s−1)i + (1− αv1)∂L1

∂Csi(5.23)

vβsi = αv2 · vc(s−1)i + (1− αv2)∂L1

∂Csi(5.24)

and the corresponding bias-corrected first moment terms,

vcsi = vcsi/(1− αsv1) (5.25)

vβsi = vβsi/(1− αsv2) (5.26)

6. Update the second moment terms at iteration s,

mcsi = αm1 ·mc(s−1)i + (1− αm1)( ∂L1

∂Csi

)2(5.27)

mβsi = αm2 ·mc(s−1)i + (1− αm2)( ∂L1

∂Csi

)2(5.28)

139

and the corresponding bias-corrected second moment terms,

mcsi = mcsi/(1− αsm1) (5.29)

mβsi = mβsi/(1− αsm2) (5.30)

7. Update the parameters as,

Csi = C(s−1)i −η1√

mcsi + εvcsi (5.31)

βsi = β(s−1)i −η2√

mβsi + εvβsi (5.32)

where, η1 are η2 are the learning step parameters, ε is a small constant number

10−8, which is to prevent singularity.

8. Update the iteration index s ← s + 1, and repeat 2 ∼ 7 until the cost

function in (5.20) converges to a small number.

After the cost function in (5.20) converged, with the optimized parameters

C = [C1, C2, ..., CN ] and β = [β1, β2, ..., βN ], we can obtain the approximated

harmonic signal hk = [hk1, hk2, ..., hkT ], where the tth output is obtained as,

ht =N∑i=1

[Cicos(2πfit+ βsi)

](5.33)

Remark 5.3.1. The harmonic components part of the model is firstly used to

perform the neural network learning process. It is noted that, with the pre-known

harmonic frequencies, the harmonic model aims at capturing the vibration infor-

mation of the normal gear operation. Since the model is structured with periodic

sinusoids with pre-defined frequencies. The parts of the non-periodic signals and

periodic signals with different frequencies will be leftover.

Then the first residual error vector δk = [δk1, δk2, ..., δkT ] at revolution time

period k can be obtained, where the tth residual term of the kth revolution period

140

is derived as,

δkt = ykt − hkt (5.34)

The further modelling of the residual error term is carried out with similar

learning approach. The residual harmonic model designed in (5.39) is employed

to extract the residual periodic information from the δk by an approximation

process. The optimization problem then can be formulated as,

minγ22‖εrk‖2+

d32‖a‖2+d4

2‖b‖2

s.t. εrk = δk − rk (5.35)

where, rk = [rk1, rk2, ..., rkT ] is the output vector of the model in (5.39) at rev-

olution time period k, d3 and d4 are the regularization parameters. Then the

objective function at iteration s with randomly selected sample δkt be calculated

as,

L2(a; b) =γ22

(δkt − rkt)2 +d32

p∑j=1

a2sj +d42

p∑j=1

b2sj (5.36)

and the derivatives at iteration s are obtained as,

∂L2

∂asj= −γ2(δkt − rkt) · cos(2πfsj t) + d3asj (5.37)

∂L2

∂bsj= −γ2(δkt − rkt) · sin(2πfsj t) + d4bsj (5.38)

Similar to the learning process presented from (5.19) ∼ (5.32), the Adam

stochastic gradient descent based BP is applied to obtain the optimal solution

a = [a1, a2, ..., ap] and b = [b1, b2, ..., bp] for the problem in (5.35). The estimated

residual periodic signal can be obtained as, rk = [rk1, rk2, ..., rkT ], where the tth

141

term is derived as,

rkt =

p∑j=1


](5.39)

Then, the final residual error term can be obtained as

εk = δk − rk (5.40)

Remark 5.3.2. With the neural network learning based modelling process, the

structured model is trained to extract different vibration signal step-by-step. The

harmonic frequency part is firstly derived with the pre-known frequencies by us-

ing the BP training approach with Adam stochastic gradient descent. Then, the

residual periodic signal part is extracted by the similar learning method. As men-

tioned in the Remark 5.2.4, because of the different characteristics of the gear

crack induced vibration signal and the normal healthy gear operational vibration,

with the developed neural network learning based modelling scheme, the complex

signal can be decomposed.

5.4 Performance Investigation

The neural network learning based signal modelling scheme is employed for gear

vibration signal modelling and tooth crack evolution monitoring. The gear test

dataset used in this section was collected by Swinburne University of Technology

and Australian DSTO [204]. The details of the gear test equipment configurations

can be found in [161, 168]. As discussed in [205], there are three major factors,

which have significant impact on the evolution of the gear vibration signal: (1)

the changes of gear equipment load; (2) the gear crack and wear development;

(3) the changes due to test gear dismantlement and installation for equipment

142

inspections. It is noted that the first two factors may significantly affect the

signal, which are also investigated in this section.

We firstly implement the developed modelling approach based on the model

in (5.12). With the recursively training process (5.19)∼(5.32), the error conver-

gences of the learning process at different gear operation stages are presented in

Fig. 5.2.

Remark 5.4.1. With the Adam stochastic gradient descent, the developed learn-

ing process is able to converge to a small number at different gear evolution time

periods. The mean square error (MSE) in the Fig. 5.2 is calculated based on the

first residual error δt. It is noted that, at the stage of G6b.110, the learning pro-

cess converged to a much larger value, compared to G6b.30 and G6b.100. Because

the gear crack occurred at G6b.110, the induced non-periodic signal could not be

captured by the structured harmonic model in (5.12), which mostly remained in

the residual term.

In addition, the signal modelling performance of the harmonic model is pre-

sented in Fig. 5.3. With the known harmonic frequencies, the model is trained

to collect the normal gear operational information. The other parts of signals,

e.g. the non-periodic signal, and unknown periodic signal, are extracted in the

residual signals.

Remark 5.4.2. As shown in Fig. 5.3(a), the model of the harmonic components

part ht with constant amplitude and phase has achieved excellent approximation

performance at the initial healthy stage of the gear. Presented in Fig. 5.3(b), the

residual signal at G6b.30 is roughly periodic. However, when a crack occurs, the

harmonic model performed poorly at the signal approximation, presented in Fig.

5.3(c). A much larger residual error was obtained, as illustrated in Fig. 5.3(d).

It is not only due to the gear crack induced impulse signal, but also resulted from

some residual harmonic frequency components which are caused by gear wear and

143

Figure 5.2: The convergence of the developed stochastic learning process withthe harmonic signal model for the signal at different revolution time periods

the changes of the system load.

After the harmonic frequency components extraction, the first residual signal

can be obtained as (5.34). With the Fourier transform, the frequency spectrum

of the residual signals δt at different gear evolution period is presented in Fig.

5.4. Compared to the frequency spectrum of the gear vibration signal at G6b.50,

there is an additional periodic signal around the shaft order of 30 at G6b.100.

As we can see from the evolution process of the signal in terms of the magnitude

changes of the frequency with shaft order 20 ∼ 40, shown in Fig. 5.4(d), a rapid

energy growth is observed around G6b.64.

To further investigate the residual signal δk by solving the problem in (5.35), a

similar learning procedure is carried out. After eliminating the residual periodic

signal, the non-periodic residual signal is then obtained as (5.40). Its energy

distribution from G6b.50 to G6b.110 is shown in Fig. 5.5(c).

144

Figure 5.3: Signal approximation performance using harmonic model h(t), (a)signal approximation at G6b.30; (b) first residual signal extracted at G6b.30; (c)signal approximation at G6b.110; (d) first residual signal extracted at G6b.110.

Remark 5.4.3. The energy distribution of the residual signal δt derived from

G6b.50 to G6b.110 is presented in Fig. 5.5(a), which is similar to the results

obtained in [170]. However, a large step jump was observed at G6b.64, when the

system load was changed from 30kW to 45kW, shown in Fig. 5.5(b), which indi-

cates that the system load could have significant impact on the gear vibration. In

contrast, after further investigations on the residual signal, the impact of system

load has been removed after the residual harmonics are eliminated. As shown

145

Figure 5.4: The frequency spectrum of the residual signal δt at (a) G6b. 50; (b)G6b. 100; (c) G6b. 110. (d) the gear evolution process from G6b. 50 to G6b.110, in terms of magnitude summation of the frequency band with shaft order 20∼ 40.

in Fig. 5.5(c), the energy distribution of the second residual signal εt shows very

clear trends for the gear crack propagation with an exponential developing process,

which suggests that the gear cracking process is successfully monitored.

146

(a)

(b)

(c)

Figure 5.5: (a) Energy distribution of the first residual signal δt extracted fromG6b.50 to G6b. 110; (b) the system load from G6b.50 to G6b. 110; (c) energydistribution of the second residual signal ε(t) extracted from G6b.50 to G6b. 110.

147

5.5 Conclusions

In this chapter, a neural network learning based sinusoidal structured modelling

approach has been developed for the gear mesh vibration signal modelling. Ac-

cording to the different properties of the gear vibration and crack-induced signal,

the developed signal model has been constructed with three parts, the harmonic

frequency components, the residual harmonic signal induced by the equipment

wear and the changes of system load, and the non-periodic residual signal which

is mainly induced by crack. Inspired by the neural network learning process,

the Adam stochastic gradient descent based BP has been applied for the train-

ing of the structured signal model. The simulation results have shown that the

structured learning approach can successfully process the vibration signal and

capture the information of gear crack propagation for system monitoring. The

future work on different signal decomposition methods, and the analysis on the

frequency distribution of the residual error are under the authors’ investigation.

148

Chapter 6

Conclusions and Future Work

In this chapter, the contributions of the thesis are summarized. Some future

research work is presented.

6.1 Summary of Contributions

The thesis has focused on developing efficient and robust neural network learning

based schemes for real-time dynamical modelling problems, e.g., system prog-

nostics and predictions. For real-time dynamical system modelling applications,

the employed methods are required to be accurate, efficient, and robust against

system noise. To improve the efficiency of the neural network learning process for

sequence predictions, both the network structures and the training approaches

have been investigated. Furthermore, a neural network learning based sinusoidal

structured modelling approach has been developed for vibration signal process-

ing and gear tooth cracking process modelling. The major contributions of each

chapter are summarized as follows.

In Chapter 3, a NARX network-based approach has been proposed for dy-

namical system modelling and applied for turbofan engine system prognostics.

To effectively capture the dynamical features from the massive sensory data and

149

improve the model robustness against noise, a special feedback structure has been

designed with two different kinds of sequential feedback. Furthermore, based on

the Monte-Carlo stochastic approximation, a much faster training scheme has

been developed for NARX type of neural networks, compared to the conventional

BP based training methods. The applications on the NASA simulated turbofan

engine dataset have shown that, with the help of feedback structures and the

developed training approach, the dynamical modelling capability of the network

has been significantly improved. Compared to the traditional BP trained NARX

network, the developed approach has achieved a much faster learning process and

excellent performance.

In Chapter 4, a novel neural network learning scheme has been proposed for

dynamical system modelling problems with multi-variate inputs and sequence

output, which is called BFSC. Motivated by human teaching process, for the se-

quence output problems, a backwards convolution based training scheme has been

developed as a knowledge explaining procedure. During the training, the input

feature space and output feature space are generated by forwards propagating

the input sequences, and backwards processing the output sequences, respec-

tively. The knowledge learning process has been carried out between the input

and output feature space. It is noted that, by adding the output feature layer, the

modelling capability of the neural network has been significantly improved. The

proposed learning scheme has been successfully applied for the ELM and ESNs

models. According to the simulations and comparisons of four different models,

ELM, ESNs, BFSC-ELM, and BFSC-ESNs, on Mackey-Glass system sequence

predictions and NASA turbofan engine prognostics, the BFSC trained ELM and

ESNs have demonstrated superior performance compared to the original ELM

and ESNs, which suggests that the backwards information processing could help

the neural network learn better.

Chapter 5 has investigated the structured learning approach for vibration

150

signal processing and gear tooth cracking process modelling. Based on the con-

ventional Fourier series based sinusoidal modelling scheme, the signal model has

been constructed with the superposition of three parts which represent, signal

harmonic components generated by the normal gear operations, the residual pe-

riodic signal affected by the system load and equipment wear, and the residual

non-periodic part induced by gear tooth crack and system disturbances. Moti-

vated by the neural network learning, the BP training has been applied for the

learning process of the structured model. Adam stochastic gradient descent has

been employed to obtain the optimal solution. A set of gear rig test vibration data

has been used to validate the developed learning scheme. The simulation results

have shown that the developed structured learning approach has successfully de-

composed the vibration signal into different components. And the gear cracking

process has been successfully tracked by the energy distribution of the residual

non-periodic signal part, which suggests that the developed learning scheme is

able to capture the features of the gear cracking, and provide effective informa-

tion for gear system prognostics.

6.2 Future Work

Based on the work discussed above, some interesting points and research direc-

tions are presented in this section.

6.2.1 Investigations on Long Short-term Memory of the

NARX Type of Neural Network

RNNs is a powerful tool for dynamical system modelling and sequential data

processing. However, it is very hard to train the RNNs to learn the long-term

151

dependences from the training data sequences. Gradient vanishing and exploding

problems may occur during the training. To address those problems, the long-

short term memory (LSTM) technique has been developed with addictive gating

operations [206]. However, it is noted that it is still very challenging to capture

the long-term dependences in many cases by using LSTM [46, 207].

On the other hand, little research attention has been attracted on the NARX

neural network, which allows direct connections from a sequence of past system

states. The limitations of the NARX network are summarized as follows,

1. The capability of learning long-term dependencies of NARX network is

limited by the predefined delay number.

2. The structural efficiency of the NARX is low, especially, when the tapped

delayed number is large. The fully-connections from feedback sequence to

the current state could not be necessary.

3. Although the NARX network performs better than the conventional RNNs

on the vanishing gradients problem during the training, the effect is limited.

To exploiting the long term dependencies based on NARX RNNs, a novel ar-

chitecture has been developed in [46], which is called mixed history RNNs (MIST

RNNs). Compared to LSTM, the MIST RNNs shows much better properties

against vanishing gradients. However, the investigations on learning the long-

term dependencies are still ongoing. Compared to the RNNs, the non-linear

sequential feedback architectures could bring a lot of benefits on sequential data

processing.

152

Figure 6.1: Sketch of Newtonian neural computing schemes: (a) the structure ofmultilayer feed-forward neural networks; (b) a typical network topological struc-ture of the proposed DKNN; (c) an example with multi-sequential inputs andsequence output.

6.2.2 Kinematic Structured Neural Network for Dynam-

ical System State Predictions

Sensory data sequences collected from complicated systems contain important

dynamical features for system modelling, prognostics, and predictions. However,

153

forced by the operating conditions, noises, and uncertainties, everything in the

universe is changing over time, as well as the observations and measurements of

systems.

The foundation of classical mechanics was built up to describe the motions of

and forces on objects, However, when it comes to dealing with some complex ob-

jects and systems, the forces and masses are always unmeasurable. According to

the current level of human cognition, it becomes very challenging to evaluate and

predict the evolutions of sophisticated systems [208], e.g. complex mechatronic

systems, financial market, and spacecraft navigation system.

The ANNs is known for its universal approximation capability. Based on the

training datasets, the ANNs is able to provide a good mathematical representation

of the input-output relationship. However, the ANNs could learn wrong, affected

by the noises and disturbances in training dataset. Furthermore, learning is not

equivalent to approximation. The biological neurons do have their own structures.

The connections between neurons and even the structure of the neuron itself may

be changing when we are learning.

It could be a very interesting topic to combine mechanics and computer sci-

ence together. As shown in Fig. 6.1, for the dynamical system prognostics and

predictions, we could develop a kinematic structured neural network to induce the

network to learn the dynamical features in the data. The novel network structures

and training scheme could significantly enhance neural network learning.

6.2.3 A Neural Adaptive Feedback Scheme for Time-varying

Non-linear System Modelling

Time-varying non-linearity is a ubiquitous phenomenon in nature. Many time

series measured from the real-life are non-linear or even chaotic, such as non-

linear wireless transmission signal, glucose dynamics, electric loads, rainfall and

154

Figure 6.2: Neural adaptive feedback scheme.

temperature [209]. Such non-linear dynamics in measurements bring a lot of

challenges for system identification, fault diagnosis, and time-series forecasting.

Since decades ago, extensive research has been carried out for the dynamical

system identification and time series prediction to harness the non-linearity and

chaos [52, 167, 210].

In light of that, the neural adaptive feedback scheme is worthwhile to be inves-

tigated for time-varying nonlinear system modelling. A brief idea is shown in Fig.

6.2. Based on a linear feedback model with external inputs ε(t), a neural network

is applied to adaptively adjust the feedback parameters. Further research can be

carried out in terms of the network adaptive structures and training schemes.

In summary, the aim of this research is to develop intelligent neural network

learning based approaches which are able to do accurate predictions and prognos-

155

tics, based on the learned knowledge. In many real-world applications, it is much

more challenging to make predictions, compared to approximation and classifi-

cation tasks. More ideas and theoretical breakthroughs are required to push the

current AI onto a higher level.

156

Bibliography

[1] R. Qiu, S. Yuan, J. Xiao, X. D. Chen, C. Selomulya, X. Zhang, and M. W.

Woo, “The effects of edge functional groups on water transport in graphene

oxide membranes,” Applied Materials and Interfaces, 2019. 1

[2] R. Qiu, J. Xiao, and X. D. Chen, “Further understanding of the biased

diffusion for peptide adsorption on uncharged solid surfaces that strongly

interact with water molecules,” Colloids and Surfaces A: Physicochemical

and Engineering Aspects, vol. 518, pp. 197–207, 2017. 1

[3] G. Deco, V. K. Jirsa, and A. R. McIntosh, “Emerging concepts for the dy-

namical organization of resting-state activity in the brain,” Nature Reviews

Neuroscience, vol. 12, no. 1, p. 43, 2011. 1

[4] M. Bando, K. Hasebe, A. Nakayama, A. Shibata, and Y. Sugiyama, “Dy-

namical model of traffic congestion and numerical simulation,” Physical Re-

view E, vol. 51, no. 2, p. 1035, 1995. 1

[5] C. Castellano, S. Fortunato, and V. Loreto, “Statistical physics of social

dynamics,” Reviews of Modern Physics, vol. 81, no. 2, p. 591, 2009. 1

[6] D. J. Earn, P. Rohani, B. M. Bolker, and B. T. Grenfell, “A simple model

for complex dynamical transitions in epidemics,” Science, vol. 287, no. 5453,

pp. 667–670, 2000. 1

157

[7] P. Arena, L. Fortuna, M. Frasca, and G. Sicurella, “An adaptive, self-

organizing dynamical system for hierarchical control of bio-inspired loco-

motion,” IEEE Transactions on Systems, Man, and Cybernetics, Part B

(Cybernetics), vol. 34, no. 4, pp. 1823–1837, 2004. 1

[8] W. Tabor, C. Juliano, and M. K. Tanenhaus, “Parsing in a dynamical sys-

tem: An attractor-based account of the interaction of lexical and struc-

tural constraints in sentence processing,” Language and Cognitive Processes,

vol. 12, no. 2-3, pp. 211–271, 1997. 1

[9] Y.-H. Pao, G.-H. Park, and D. J. Sobajic, “Learning and generalization

characteristics of the random vector functional-link net,” Neurocomputing,

vol. 6, no. 2, pp. 163–180, 1994. 4, 47, 48, 66, 99

[10] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning machine: the-

ory and applications,” Neurocomputing, vol. 70, no. 1-3, pp. 489–501, 2006.

4, 20, 35, 48, 66, 99, 103, 104

[11] L. Xie, Y. Yang, Z. Zhou, J. Zheng, M. Tao, and Z. Man, “Dynamic neural

modeling of fatigue crack growth process in ductile alloys,” Information

Sciences, vol. 364, pp. 167–183, 2016. 4, 15, 28, 67, 98, 129

[12] T. Lin, B. G. Horne, P. Tino, and C. L. Giles, “Learning long-term depen-

dencies in narx recurrent neural networks,” IEEE Transactions on Neural

Networks, vol. 7, no. 6, pp. 1329–1338, 1996. 4, 15, 23, 27, 67

[13] M. T. Hagan, H. B. Demuth, M. H. Beale, and O. De Jesus, Neural network

design. PWS Pub. Boston, 1996, vol. 20. 5

[14] I. Mezic, “Spectral properties of dynamical systems, model reduction and

decompositions,” Nonlinear Dynamics, vol. 41, no. 1-3, pp. 309–325, 2005.

10

158

[15] J. G. Kuschewski, S. Hui, and S. H. Zak, “Application of feedforward neural

networks to dynamical system identification and control,” IEEE Transac-

tions on Control Systems Technology, vol. 1, no. 1, pp. 37–49, 1993. 10

[16] S. Billings, H. Jamaluddin, and S. Chen, “Properties of neural networks

with applications to modelling non-linear dynamical systems,” International

Journal of Control, vol. 55, no. 1, pp. 193–224, 1992. 10

[17] T. Toni, D. Welch, N. Strelkowa, A. Ipsen, and M. P. Stumpf, “Approximate

bayesian computation scheme for parameter inference and model selection

in dynamical systems,” Journal of the Royal Society Interface, vol. 6, no. 31,

pp. 187–202, 2008. 10

[18] C. Liu, W. Huang, F. Sun, M. Luo, and C. Tan, “Lds-fcm: A linear dy-

namical system based fuzzy c-means method for tactile recognition,” IEEE

Transactions on Fuzzy Systems, vol. 27, no. 1, pp. 72–83, 2019. 12

[19] W. Huang, F. Sun, L. Cao, D. Zhao, H. Liu, and M. Harandi, “Sparse coding

and dictionary learning with linear dynamical systems,” in Proceedings of

the IEEE Conference on Computer Vision and Pattern Recognition, 2016,

pp. 3938–3947. 12

[20] Z. Liu and M. Hauskrecht, “A regularized linear dynamical system frame-

work for multivariate time series analysis.” in Twenty-Ninth AAAI Confer-

ence on Artificial Intelligence, 2015, pp. 1798–1804. 12

[21] K. S. Narendra and K. Parthasarathy,“Identification and control of dynami-

cal systems using neural networks,” IEEE Transactions on Neural Networks,

vol. 1, no. 1, pp. 4–27, 1990. 5, 12, 15, 17, 66, 67, 78, 91, 98, 100

[22] K. S. Narendra and S. Mukhopadhyay, “Adaptive control using neural net-

works and approximate models,” IEEE Transactions on Neural Networks,

vol. 8, no. 3, pp. 475–485, 1997. 13

159

[23] H. U. Voss, J. Timmer, and J. Kurths, “Nonlinear dynamical system identi-

fication from uncertain and indirect measurements,” International Journal

of Bifurcation and Chaos, vol. 14, no. 06, pp. 1905–1933, 2004. 13

[24] L. Glass and M. C. Mackey, “Pathological conditions resulting from insta-

bilities in physiological control systems,” Annals of the New York Academy

of Sciences, vol. 316, no. 1, pp. 214–235, 1979. 13

[25] A. U. Levin and K. S. Narendra, “Control of nonlinear dynamical systems

using neural networks. ii. observability, identification, and control,” IEEE

Transactions on Neural Networks, vol. 7, no. 1, pp. 30–42, 1996. 15

[26] ——, “Control of nonlinear dynamical systems using neural networks: Con-

trollability and stabilization,” IEEE Transactions on Neural Networks,

vol. 4, no. 2, pp. 192–206, 1993. 15

[27] S. S. Haykin, S. S. Haykin, S. S. Haykin, and S. S. Haykin, Neural networks

and learning machines. Pearson Upper Saddle River, 2009, vol. 3. 17, 54,

55, 66, 79, 98

[28] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with

deep convolutional neural networks,” in Advances in Neural Information

Processing Systems, 2012, pp. 1097–1105. 17, 38

[29] S. Simani, “Identification and fault diagnosis of a simulated model of an

industrial gas turbine,” IEEE Transactions on Industrial Informatics, vol. 1,

no. 3, pp. 202–216, 2005. 17, 66

[30] F. Rosenbaltt, “The perceptron–a perciving and recognizing automation,”

Report 85-460-1 Cornell Aeronautical Laboratory, Ithaca, Tech. Rep., 1957.

17

160

[31] G. Cybenko, “Approximation by superpositions of a sigmoidal function,”

Mathematics of Control, Signals and Systems, vol. 2, no. 4, pp. 303–314,

1989. 17

[32] S. Haykin, Neural networks: a comprehensive foundation. Prentice Hall

PTR, 1994. 19

[33] Z. Man, K. Lee, D. Wang, Z. Cao, and S. Khoo, “Robust single-hidden layer

feedforward network-based pattern classifier,” IEEE Transactions on Neural

Networks and Learning Systems, vol. 23, no. 12, pp. 1974–1986, 2012. 19,

48, 50, 67, 79

[34] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal

representations by error propagation,” California Univ San Diego La Jolla

Inst for Cognitive Science, Tech. Rep., 1985. 20, 66

[35] K. Simonyan and A. Zisserman,“Very deep convolutional networks for large-

scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. 21

[36] Y. Kim, “Convolutional neural networks for sentence classification,” arXiv

preprint arXiv:1408.5882, 2014. 21

[37] Y. LeCun, Y. Bengio et al., “Convolutional networks for images, speech,

and time series,” The Handbook of Brain Theory and Neural Networks, vol.

3361, no. 10, p. 1995, 1995. 21, 38

[38] R. G. Lyons, Understanding Digital Signal Processing, 3/E. Pearson Ed-

ucation India, 2011. 22

[39] J. J. Hopfield, “Neural networks and physical systems with emergent col-

lective computational abilities,” Proceedings of the National Academy of

Sciences, vol. 79, no. 8, pp. 2554–2558, 1982. 23, 67

161

[40] M. I. Jordan, “Serial order: A parallel distributed processing approach,” in

Advances in Psychology. Elsevier, 1997, vol. 121, pp. 471–495. 23, 24, 67,

98

[41] J. L. Elman, “Finding structure in time,” Cognitive Science, vol. 14, no. 2,

pp. 179–211, 1990. 23, 25, 67, 98

[42] ——, “Learning and development in neural networks: The importance of

starting small,” Cognition, vol. 48, no. 1, pp. 71–99, 1993. 25

[43] H. T. Siegelmann, B. G. Horne, and C. L. Giles,“Computational capabilities

of recurrent narx neural networks,” IEEE Transactions on Systems, Man,

and Cybernetics, Part B (Cybernetics), vol. 27, no. 2, pp. 208–215, 1997.

27, 67

[44] J. M. P. Menezes Jr and G. A. Barreto, “Long-term time series prediction

with the narx network: an empirical evaluation,” Neurocomputing, vol. 71,

no. 16-18, pp. 3335–3343, 2008. 28

[45] M. Ardalani-Farsa and S. Zolfaghari, “Chaotic time series prediction with

residual analysis method using hybrid elman–narx neural networks,” Neu-

rocomputing, vol. 73, no. 13-15, pp. 2540–2553, 2010. 28, 98

[46] R. DiPietro, C. Rupprecht, N. Navab, and G. D. Hager, “Analyzing and ex-

ploiting narx recurrent neural networks for long-term dependencies,” arXiv

preprint arXiv:1702.07805, 2017. 28, 152

[47] L. Zhi, Y. Zhu, H. Wang, Z. Xu, and Z. Man, “A recurrent neural network

for modeling crack growth of aluminium alloy,” Neural Computing and Ap-

plications, vol. 27, no. 1, pp. 197–203, 2016. 28, 67

[48] B. Schrauwen, D. Verstraeten, and J. Van Campenhout, “An overview of

reservoir computing: theory, applications and implementations,” in Pro-

162

ceedings of the 15th European Symposium on Artificial Neural Networks. p.

471-482 2007, 2007, pp. 471–482. 30

[49] M. Lukosevicius and H. Jaeger, “Reservoir computing approaches to recur-

rent neural network training,” Computer Science Review, vol. 3, no. 3, pp.

127–149, 2009. 30, 99

[50] H. Jaeger, “The ”echo state” approach to analysing and training recurrent

neural networks-with an erratum note,” Bonn, Germany: German National

Research Center for Information Technology GMD Technical Report, vol.

148, no. 34, p. 13, 2001. 30, 32, 99, 106

[51] H. Jaeger and H. Haas, “Harnessing nonlinearity: Predicting chaotic sys-

tems and saving energy in wireless communication,” Science, vol. 304, no.

5667, pp. 78–80, 2004. 30, 99, 117

[52] J. B. Butcher, D. Verstraeten, B. Schrauwen, C. Day, and P. Haycock,

“Reservoir computing and extreme learning machines for non-linear time-

series data analysis,” Neural Networks, vol. 38, pp. 76–89, 2013. 30, 66, 99,

155

[53] B. Zhang, D. J. Miller, and Y. Wang, “Nonlinear system modeling with

random matrices: echo state networks revisited,” IEEE Transactions on

Neural Networks and Learning Systems, vol. 23, no. 1, pp. 175–182, 2012.

30, 98, 99

[54] M. Rigamonti, P. Baraldi, E. Zio, I. Roychoudhury, K. Goebel, and S. Poll,

“Ensemble of optimized echo state networks for remaining useful life pre-

diction,” Neurocomputing, vol. 281, pp. 121–138, 2018. 30

[55] F. M. Bianchi, E. Maiorino, M. C. Kampffmeyer, A. Rizzi, and R. Jenssen,

“An overview and comparative analysis of recurrent neural networks for

short term load forecasting,” arXiv preprint arXiv:1705.04378, 2017. 30, 99

163

[56] D. Li, M. Han, and J. Wang, “Chaotic time series prediction based on a

novel robust echo state network,” IEEE Transactions on Neural Networks

and Learning Systems, vol. 23, no. 5, pp. 787–799, 2012. 32, 99

[57] I. B. Yildiz, H. Jaeger, and S. J. Kiebel, “Re-visiting the echo state prop-

erty,” Neural Networks, vol. 35, pp. 1–9, 2012. 32, 106

[58] Z. K. Malik, A. Hussain, and Q. J. Wu, “Multilayered echo state machine:

a novel architecture and algorithm,” IEEE Transactions on Cybernetics,

vol. 47, no. 4, pp. 946–959, 2017. 32, 106

[59] H. Jaeger, M. Lukosevicius, D. Popovici, and U. Siewert, “Optimization and

applications of echo state networks with leaky-integrator neurons,” Neural

Networks, vol. 20, no. 3, pp. 335–352, 2007. 33, 104

[60] A. Rodan and P. Tino, “Minimum complexity echo state network,” IEEE

Transactions on Neural Networks, vol. 22, no. 1, pp. 131–144, 2011. 33, 34

[61] S. Tamura and M. Tateishi, “Capabilities of a four-layered feedforward neu-

ral network: four layers versus three,” IEEE Transactions on Neural Net-

works, vol. 8, no. 2, pp. 251–255, 1997. 35, 37, 104, 109, 110, 111, 117

[62] G.-B. Huang, “Learning capability and storage capacity of two-hidden-layer

feedforward networks,” IEEE Transactions on Neural Networks, vol. 14,

no. 2, pp. 274–281, 2003. 35, 37

[63] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of

data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.

38

[64] Y. Bengio et al., “Learning deep architectures for ai,” Foundations and

Trends® in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009. 38

164

[65] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for

deep belief nets,” Neural Computation, vol. 18, no. 7, pp. 1527–1554, 2006.

38, 129

[66] K. Simonyan and A. Zisserman,“Very deep convolutional networks for large-

scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. 38

[67] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no.

7553, p. 436, 2015. 38

[68] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural

Networks, vol. 61, pp. 85–117, 2015. 38

[69] A.-r. Mohamed, G. E. Dahl, G. Hinton et al., “Acoustic modeling using

deep belief networks,” IEEE Transactions on Audio, Speech and Language

Processing, vol. 20, no. 1, pp. 14–22, 2012. 38

[70] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Convolutional deep be-

lief networks for scalable unsupervised learning of hierarchical representa-

tions,” in Proceedings of the 26th annual international conference on ma-

chine learning. ACM, 2009, pp. 609–616. 38

[71] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun,

“Overfeat: Integrated recognition, localization and detection using convo-

lutional networks,” arXiv preprint arXiv:1312.6229, 2013. 38

[72] Z.-Q. J. Xu, Y. Zhang, T. Luo, Y. Xiao, and Z. Ma, “Frequency princi-

ple: Fourier analysis sheds light on deep neural networks,” arXiv preprint

arXiv:1901.06523, 2019. 38

[73] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Under-

standing deep learning requires rethinking generalization,” arXiv preprint

arXiv:1611.03530, 2016. 39

165

[74] Y. Shang and B. W. Wah,“Global optimization for neural network training,”

Computer, vol. 29, no. 3, pp. 45–54, 1996. 39

[75] S. Boyd and L. Vandenberghe, Convex optimization. Cambridge University

Press, 2004. 41, 51

[76] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representa-

tions by back-propagating errors,” Nature, vol. 323, no. 6088, p. 533, 1986.

41, 44, 129, 137

[77] A. Van Ooyen, B. Nienhuis et al., “Improving the convergence of the back-

propagation algorithm.” Neural Networks, vol. 5, no. 3, pp. 465–471, 1992.

41

[78] R. Hecht-Nielsen, “Theory of the backpropagation neural network,” in Neu-

ral Networks for Perception. Elsevier, 1992, pp. 65–93. 41

[79] S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv

preprint arXiv:1609.04747, 2016. 43

[80] H. Robbins and S. Monro, “A stochastic approximation method,” in Herbert

Robbins Selected Papers. Springer, 1985, pp. 102–109. 43

[81] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance of

initialization and momentum in deep learning,” in International Conference

on Machine Learning, 2013, pp. 1139–1147. 44

[82] Y. Nesterov, “A method for unconstrained convex minimization problem

with the rate of convergence o (1/k2),” in Doklady AN USSR, vol. 269,

1983, pp. 543–547. 44

[83] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”

arXiv preprint arXiv:1412.6980, 2014. 44, 91, 129

166

[84] A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht, “The marginal

value of adaptive gradient methods in machine learning,” in Advances in

Neural Information Processing Systems, 2017, pp. 4148–4158. 46

[85] N. S. Keskar and R. Socher, “Improving generalization performance by

switching from adam to sgd,” arXiv preprint arXiv:1712.07628, 2017. 46

[86] G.-B. Huang, H. Zhou, X. Ding, and R. Zhang, “Extreme learning machine

for regression and multiclass classification,” IEEE Transactions on Systems,

Man, and Cybernetics, Part B (Cybernetics), vol. 42, no. 2, pp. 513–529,

2012. 46

[87] G.-B. Huang, D. H. Wang, and Y. Lan, “Extreme learning machines: a

survey,” International Journal of Machine Learning and Cybernetics, vol. 2,

no. 2, pp. 107–122, 2011. 46, 66, 99

[88] S. Scardapane and D. Wang, “Randomness in neural networks: an

overview,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge

Discovery, vol. 7, no. 2, p. e1200, 2017. 46, 66, 104

[89] Z. Man, K. Lee, D. Wang, Z. Cao, and S. Khoo,“An optimal weight learning

machine for handwritten digit image recognition,” Signal Processing, vol. 93,

no. 6, pp. 1624–1638, 2013. 47, 67, 104

[90] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning machine: a

new learning scheme of feedforward neural networks,” in International Joint

Conference on Neural Networks, vol. 2. IEEE, 2004, pp. 985–990. 48, 66

[91] G. Huang, G.-B. Huang, S. Song, and K. You, “Trends in extreme learning

machines: A review,” Neural Networks, vol. 61, pp. 32–48, 2015. 48, 66, 99,

110

167

[92] X. Wang and M. Han, “Online sequential extreme learning machine with

kernels for nonstationary time series prediction,” Neurocomputing, vol. 145,

pp. 90–97, 2014. 48

[93] C. Wan, Z. Xu, P. Pinson, Z. Y. Dong, and K. P. Wong, “Probabilistic fore-

casting of wind power generation using extreme learning machine,” IEEE

Transactions on Power Systems, vol. 29, no. 3, pp. 1033–1044, 2014. 48

[94] G.-B. Huang, X. Ding, and H. Zhou, “Optimization method based extreme

learning machine for classification,” Neurocomputing, vol. 74, no. 1-3, pp.

155–163, 2010. 48

[95] H.-J. Rong, Y.-S. Ong, A.-H. Tan, and Z. Zhu, “A fast pruned-extreme

learning machine for classification problem,” Neurocomputing, vol. 72, no.

1-3, pp. 359–366, 2008. 48

[96] A. A. Mohammed, R. Minhas, Q. J. Wu, and M. A. Sid-Ahmed, “Human

face recognition based on multidimensional pca and extreme learning ma-

chine,” Pattern Recognition, vol. 44, no. 10-11, pp. 2588–2597, 2011. 48

[97] Z. Man, K. Lee, D. Wang, Z. Cao, and C. Miao, “A new robust training

algorithm for a class of single-hidden layer feedforward neural networks,”

Neurocomputing, vol. 74, no. 16, pp. 2491–2501, 2011. 50

[98] W. Cao, X. Wang, Z. Ming, and J. Gao, “A review on neural networks with

random weights,” Neurocomputing, vol. 275, pp. 278–287, 2018. 50

[99] N.-Y. Liang, G.-B. Huang, P. Saratchandran, and N. Sundararajan, “A fast

and accurate online sequential learning algorithm for feedforward networks,”

IEEE Transactions on Neural Networks, vol. 17, no. 6, pp. 1411–1423, 2006.

50

168

[100] Q.-Y. Zou, X.-J. Wang, C.-J. Zhou, and Q. Zhang, “The memory degrada-

tion based online sequential extreme learning machine,” Neurocomputing,

vol. 275, pp. 2864–2879, 2018. 50

[101] S. Xu and J. Wang, “Dynamic extreme learning machine for data stream

classification,” Neurocomputing, vol. 238, pp. 433–449, 2017. 50

[102] Q.-Y. Zhu, A. K. Qin, P. N. Suganthan, and G.-B. Huang, “Evolutionary

extreme learning machine,” Pattern Recognition, vol. 38, no. 10, pp. 1759–

1763, 2005. 50

[103] L. Zhang and D. Zhang, “Evolutionary cost-sensitive extreme learning ma-

chine,” IEEE Transactions on Neural Networks and Learning Systems,

vol. 28, no. 12, pp. 3045–3060, 2017. 50

[104] Y. Yu and Z. Sun, “Sparse coding extreme learning machine for classifica-

tion,” Neurocomputing, vol. 261, pp. 50–56, 2017. 50

[105] G.-B. Huang and L. Chen, “Convex incremental extreme learning machine,”

Neurocomputing, vol. 70, no. 16-18, pp. 3056–3062, 2007. 50

[106] P. H. Kassani, A. B. J. Teoh, and E. Kim,“Sparse pseudoinverse incremental

extreme learning machine,” Neurocomputing, vol. 287, pp. 128–142, 2018.

50

[107] W. Yu, F. Zhuang, Q. He, and Z. Shi, “Learning deep representations via

extreme learning machines,” Neurocomputing, vol. 149, pp. 308–315, 2015.

50

[108] M. D. Tissera and M. D. McDonnell, “Deep extreme learning machines:

supervised autoencoding architecture for classification,” Neurocomputing,

vol. 174, pp. 42–49, 2016. 50

169

[109] K. Sun, J. Zhang, C. Zhang, and J. Hu, “Generalized extreme learning

machine autoencoder and a new deep neural network,” Neurocomputing,

vol. 230, pp. 374–381, 2017. 50

[110] S. He, E. Prempain, and Q. Wu, “An improved particle swarm optimizer

for mechanical design optimization problems,” Engineering Optimization,

vol. 36, no. 5, pp. 585–605, 2004. 50

[111] D. Sun, Y. Shi, and B. Zhang, “Robust optimization of constrained me-

chanical system with joint clearance and random parameters using multi-

objective particle swarm optimization,” Structural and Multidisciplinary

Optimization, vol. 58, no. 5, pp. 2073–2084, 2018. 50

[112] Z. W. Geem, “Optimal cost design of water distribution networks using

harmony search,” Engineering Optimization, vol. 38, no. 03, pp. 259–277,

2006. 50

[113] R. Shi, L. Liu, T. Long, Y. Wu, and G. G. Wang, “Multidisciplinary model-

ing and surrogate assisted optimization for satellite constellation systems,”

Structural and Multidisciplinary Optimization, vol. 58, no. 5, pp. 2173–2188,

2018. 50

[114] S. S. Rao, Engineering optimization: theory and practice. John Wiley &

Sons, 2009. 50

[115] A. N. Tikhonov, A. Goncharsky, V. Stepanov, and A. G. Yagola, Numerical

methods for the solution of ill-posed problems. Springer Science & Business

Media, 2013, vol. 328. 50

[116] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R.

Salakhutdinov, “Improving neural networks by preventing co-adaptation of

feature detectors,” arXiv preprint arXiv:1207.0580, 2012. 51

170

[117] L. Perez and J. Wang, “The effectiveness of data augmentation in image

classification using deep learning,” arXiv preprint arXiv:1712.04621, 2017.

51

[118] Y. Yao, L. Rosasco, and A. Caponnetto, “On early stopping in gradient

descent learning,” Constructive Approximation, vol. 26, no. 2, pp. 289–315,

2007. 51

[119] A. N. Tikhonov, “On the solution of ill-posed problems and the method

of regularization,” in Doklady Akademii Nauk, vol. 151, no. 3. Russian

Academy of Sciences, 1963, pp. 501–504. 51

[120] S. Wright and J. Nocedal, “Numerical optimization,” Springer Science,

vol. 35, no. 67-68, p. 7, 1999. 51

[121] A. P. Ruszczynski and A. Ruszczynski, Nonlinear optimization. Princeton

University Press, 2006, vol. 13. 52

[122] H. W. Kuhn and A. W. Tucker, “Nonlinear programming,” in Traces and

Emergence of Nonlinear Programming. Springer, 2014, pp. 247–258. 52

[123] D. G. Luenberger, Y. Ye et al., Linear and nonlinear programming.

Springer, 1984, vol. 2. 54

[124] C. Groetsch,“The theory of tikhonov regularization for fredholm equations,”

104p, Boston Pitman Publication, 1984. 55

[125] G. H. Golub, P. C. Hansen, and D. P. O’Leary,“Tikhonov regularization and

total least squares,” SIAM Journal on Matrix Analysis and Applications,

vol. 21, no. 1, pp. 185–194, 1999. 55

[126] A. N. Tikhonov and V. I. Arsenin, Solutions of ill-posed problems. Vh

Winston, 1977, vol. 14. 55

171

[127] C. M. Bishop, “Training with noise is equivalent to tikhonov regularization,”

Neural Computation, vol. 7, no. 1, pp. 108–116, 1995. 56

[128] E. Klann and R. Ramlau, “Regularization by fractional filter methods and

data smoothing,” Inverse Problems, vol. 24, no. 2, p. 025018, 2008. 56

[129] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: A geo-

metric framework for learning from labeled and unlabeled examples,” Jour-

nal of Machine Learning Research, vol. 7, pp. 2399–2434, 2006. 56

[130] M. Fuhry and L. Reichel, “A new tikhonov regularization method,” Numer-

ical Algorithms, vol. 59, no. 3, pp. 433–445, 2012. 56

[131] M. Schmidt, “Least squares optimization with l1-norm regularization,”

CS542B Project Report, pp. 14–18, 2005. 56

[132] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of

the Royal Statistical Society. Series B (Methodological), pp. 267–288, 1996.

56

[133] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face

recognition via sparse representation,” IEEE Transactions on Pattern Anal-

ysis and Machine Intelligence, vol. 31, no. 2, pp. 210–227, 2009. 56

[134] E. J. Candes, J. K. Romberg, and T. Tao, “Stable signal recovery from

incomplete and inaccurate measurements,” Communications on Pure and

Applied Mathematics: A Journal Issued by the Courant Institute of Math-

ematical Sciences, vol. 59, no. 8, pp. 1207–1223, 2006. 57

[135] E. J. Candes, M. B. Wakin, and S. P. Boyd, “Enhancing sparsity by

reweighted aDS 1 minimization,” Journal of Fourier Analysis and Appli-

cations, vol. 14, no. 5-6, pp. 877–905, 2008. 57

172

[136] M. Lustig, D. Donoho, and J. M. Pauly, “Sparse mri: The application

of compressed sensing for rapid mr imaging,” Magnetic Resonance in

Medicine: An Official Journal of the International Society for Magnetic

Resonance in Medicine, vol. 58, no. 6, pp. 1182–1195, 2007. 57

[137] E. Candes and J. Romberg, “Sparsity and incoherence in compressive sam-

pling,” Inverse Problems, vol. 23, no. 3, p. 969, 2007. 57

[138] J. B. Rosen, H. Park, and J. Glick, “Signal identification using a least l1

norm algorithm,” Optimization and Engineering, vol. 1, no. 1, pp. 51–65,

2000. 57

[139] M. Schmidt, G. Fung, and R. Rosales, “Fast optimization methods for l1

regularization: A comparative study and two new approaches,” in European

Conference on Machine Learning. Springer, 2007, pp. 286–297. 58

[140] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein et al., “Distributed

optimization and statistical learning via the alternating direction method

of multipliers,”Foundations and Trends® in Machine learning, vol. 3, no. 1,

pp. 1–122, 2011. 58

[141] A. Quattoni, X. Carreras, M. Collins, and T. Darrell, “An efficient projec-

tion for l 1,∞ regularization,” in Proceedings of the 26th Annual Interna-

tional Conference on Machine Learning. ACM, 2009, pp. 857–864. 58

[142] W. Deng, W. Yin, and Y. Zhang, “Group sparse optimization by alternating

direction method,” in Wavelets and Sparsity XV. International Society for

Optics and Photonics, 2013. 58

[143] H. Hanachi, C. Mechefske, J. Liu, A. Banerjee, and Y. Chen, “Performance-

based gas turbine health monitoring, diagnostics, and prognostics: A sur-

vey,” IEEE Transactions on Reliability, vol. 67, no. 3, pp. 1340–1363, 2018.

59, 62

173

[144] M. Tahan, E. Tsoutsanis, M. Muhammad, and Z. A. Karim, “Performance-

based health monitoring, diagnostics and prognostics for condition-based

maintenance of gas turbines: A review,” Applied Energy, vol. 198, pp. 122–

144, 2017. 59, 60, 62

[145] R. G. Giffin III, J. E. Johnson, D. W. Crall, J. W. Salvage, and P. N. Szucs,

“Turbofan engine with a core driven supercharged bypass duct,” Sep. 22

1998. 59

[146] A. K. Jardine, D. Lin, and D. Banjevic, “A review on machinery diagnostics

and prognostics implementing condition-based maintenance,” Mechanical

Systems and Signal Processing, vol. 20, no. 7, pp. 1483–1510, 2006. 60

[147] L. A. Urban, “Gas path analysis applied to turbine engine condition moni-

toring,” Journal of Aircraft, vol. 10, no. 7, pp. 400–406, 1973. 60

[148] A. Razak, Industrial gas turbines: performance and operability. Elsevier,

2007. 60

[149] P. P. Walsh and P. Fletcher, Gas turbine performance. John Wiley & Sons,

2004. 60

[150] C. Kong, J. Ki, and M. Kang, “A new scaling method for component maps

of gas turbine using system identification,” Journal of Engineering for Gas

Turbines and Power(Transactions of the ASME), vol. 125, no. 4, pp. 979–

985, 2003. 60

[151] E. Tsoutsanis, N. Meskin, M. Benammar, and K. Khorasani, “Dynamic

performance simulation of an aeroderivative gas turbine using the matlab

simulink environment,” in ASME 2013 International Mechanical Engineer-

ing Congress and Exposition. American Society of Mechanical Engineers,

2013, p. V04AT04A050. 60

174

[152] ——, “A component map tuning method for performance prediction and

diagnostics of gas turbine compressors,” Applied Energy, vol. 135, pp. 572–

585, 2014. 60

[153] ——, “An efficient component map generation method for prediction of

gas turbine performance,” in ASME Turbo Expo 2014: Turbine Technical

Conference and Exposition. American Society of Mechanical Engineers,

2014, p. V006T06A006. 60

[154] G. A. Miste and E. Benini, “Turbojet engine performance tuning with a

new map adaptation concept,” Journal of Engineering for Gas Turbines

and Power, vol. 136, no. 7, p. 071202, 2014. 60

[155] A. Lakshminarasimha, M. Boyce, and C. Meher-Homji, “Modelling and

analysis of gas turbine performance deterioration,” in ASME 1992 Interna-

tional Gas Turbine and Aeroengine Congress and Exposition. American

Society of Mechanical Engineers, 1992, p. V004T10A022. 61

[156] A. V. Zaita, G. Buley, and G. Karlsons, “Performance deterioration mod-

eling in aircraft gas turbine engines,” in ASME 1997 International Gas

Turbine and Aeroengine Congress and Exhibition. American Society of

Mechanical Engineers, 1997, p. V004T16A007. 61

[157] A. Saxena, K. Goebel, D. Simon, and N. Eklund, “Damage propagation

modeling for aircraft engine run-to-failure simulation,” in International

Conference on Prognostics and Health Management. IEEE, 2008, pp. 1–9.

62, 84, 120

[158] D. K. Frederick, J. A. DeCastro, and J. S. Litt, “User’s guide for the com-

mercial modular aero-propulsion system simulation c−mapss,”NASA/TM-

2007-215026, E-16205, Saratoga Control Systems; Saratoga Springs, NY,

United States, Tech. Rep., 2007. 62, 83, 84, 120

175

[159] Y. Li and P. Nilkitsaranont, “Gas turbine performance prognostic for

condition-based maintenance,” Applied energy, vol. 86, no. 10, pp. 2152–

2161, 2009. 62

[160] J. Yin, W. Wang, Z. Man, and S. Khoo, “Statistical modeling of gear vi-

bration signals and its application to detecting and diagnosing gear faults,”

Information Sciences, vol. 259, pp. 295–303, 2014. 63, 130, 131

[161] W. Wang and A. K. Wong, “Autoregressive model-based gear fault diagno-

sis,” Journal of Vibration and Acoustics, vol. 124, no. 2, pp. 172–179, 2002.

63, 66, 128, 129, 142

[162] L. Marinai, “Gas-path diagnostics and prognostics for aero-engines using

fuzzy logic and time series analysis,” Ph.D. dissertation, Cranfield Univer-

sity, 2004. 63

[163] N. Puggina and M. Venturini, “Development of a statistical methodology

for gas turbine prognostics,” Journal of Engineering for Gas Turbines and

Power, vol. 134, no. 2, p. 022401, 2012. 63

[164] S. S. Tayarani-Bathaie, Z. S. Vanini, and K. Khorasani, “Dynamic neural

network-based fault diagnosis of gas turbine engines,” Neurocomputing, vol.

125, pp. 153–165, 2014. 63

[165] Z. S. Vanini, K. Khorasani, and N. Meskin, “Fault detection and isolation of

a dual spool gas turbine engine using dynamic neural networks and multiple

model approach,” Information Sciences, vol. 259, pp. 234–251, 2014. 63

[166] M. Amozegar and K. Khorasani, “An ensemble of dynamic neural network

identifiers for fault detection and isolation of gas turbine engines,” Neural

Networks, vol. 76, pp. 106–121, 2016. 63

176

[167] T. De Bruin, K. Verbert, and R. Babuska, “Railway track circuit fault

diagnosis using recurrent neural networks,” IEEE Transactions on Neural

Networks and Learning Systems, vol. 28, no. 3, pp. 523–533, 2017. 66, 155

[168] Z. Man, W. Wang, S. Khoo, and J. Yin, “Optimal sinusoidal modelling of

gear mesh vibration signals for gear diagnosis and prognosis,” Mechanical

Systems and Signal Processing, vol. 33, pp. 256–274, 2012. 66, 128, 129,

131, 133, 135, 136, 142

[169] X. Dai and Z. Gao, “From model, signal to knowledge: A data-driven per-

spective of fault detection and diagnosis,” IEEE Transactions on Industrial

Informatics, vol. 9, no. 4, pp. 2226–2238, 2013. 66

[170] W. Wang, “Early detection of gear tooth cracking using the resonance de-

modulation technique,” Mechanical Systems and Signal Processing, vol. 15,

no. 5, pp. 887–903, 2001. 66, 128, 135, 145

[171] R. Chandra, Y.-S. Ong, and C.-K. Goh, “Co-evolutionary multi-task learn-

ing with predictive recurrence for multi-step chaotic time series prediction,”

Neurocomputing, vol. 243, pp. 21–34, 2017. 66, 98

[172] D. Wang, D. Liu, H. Li, H. Ma, and C. Li, “A neural-network-based online

optimal control approach for nonlinear robust decentralized stabilization,”

Soft Computing, vol. 20, no. 2, pp. 707–716, 2016. 66

[173] R. Liu, G. Meng, B. Yang, C. Sun, and X. Chen, “Dislocated time series

convolutional neural architecture: An intelligent fault diagnosis approach

for electric machine,” IEEE Transactions on Industrial Informatics, vol. 13,

no. 3, pp. 1310–1320, 2017. 66

[174] W. F. Schmidt, M. A. Kraaijveld, and R. P. Duin, “Feedforward neural

networks with random weights,” in 11th IAPR International Conference on

177

Pattern Recognition. Vol. II. Conference B: Pattern Recognition Methodol-

ogy and Systems. IEEE, 1992, pp. 1–4. 66

[175] M. Rafiei, T. Niknam, and M.-H. Khooban, “Probabilistic forecasting of

hourly electricity price by generalization of elm for usage in improved

wavelet neural network,” IEEE Transactions on Industrial Informatics,

vol. 13, no. 1, pp. 71–79, 2017. 66

[176] B. Igelnik and Y.-H. Pao, “Stochastic choice of basis functions in adaptive

function approximation and the functional-link net,” IEEE Transactions on

Neural Networks, vol. 6, no. 6, pp. 1320–1329, 1995. 67, 104

[177] M. Li and D. Wang, “Insights into randomized algorithms for neural net-

works: Practical issues and common pitfalls,” Information Sciences, vol.

382, pp. 170–178, 2017. 67

[178] D. Wang and M. Li, “Stochastic configuration networks: Fundamentals and

algorithms,” IEEE Transactions on Cybernetics, vol. 47, no. 10, pp. 3466–

3479, 2017. 67, 99

[179] T. W. Chow and Y. Fang, “A recurrent neural-network-based real-time

learning control strategy applying to nonlinear systems with unknown dy-

namics,” IEEE Transactions on Industrial Electronics, vol. 45, no. 1, pp.

151–161, 1998. 67, 98

[180] Y. Bengio, P. Simard, P. Frasconi et al., “Learning long-term dependencies

with gradient descent is difficult,” IEEE Transactions on Neural Networks,

vol. 5, no. 2, pp. 157–166, 1994. 67

[181] M. Tao, Z. Man, J. Zheng, A. Cricenti, and W. Wang, “A new dynamic

neural modelling for mechatronic system prognostics,” in 2016 International

Conference on Advanced Mechatronic Systems (ICAMechS). IEEE, 2016,

pp. 437–442. 67, 78, 85, 120

178

[182] D. Nguyen and B. Widrow, “Improving the learning speed of 2-layer neural

networks by choosing initial values of the adaptive weights,” in International

Joint Conference on Neural Networks. IEEE, 1990, pp. 21–26. 75

[183] T. Wang, J. Yu, D. Siegel, and J. Lee, “A similarity-based prognostics

approach for remaining useful life estimation of engineered systems,” in

International Conference on Prognostics and Health Management. IEEE,

2008, pp. 1–6. 84

[184] J. Sun, H. Zuo, W. Wang, and M. G. Pecht, “Application of a state space

modeling technique to system prognostics based on a health index for

condition-based maintenance,” Mechanical Systems and Signal Processing,

vol. 28, pp. 585–596, 2012. 84

[185] M. Han, S. Zhang, M. Xu, T. Qiu, and N. Wang, “Multivariate chaotic time

series online prediction based on improved kernel recursive least squares

algorithm,” IEEE Transactions on Cybernetics, no. 99, pp. 1–13, 2018. 98

[186] R. Chandra, Y.-S. Ong, and C.-K. Goh, “Co-evolutionary multi-task learn-

ing for dynamic time series prediction,” Applied Soft Computing, vol. 70,

pp. 576–589, 2018. 98

[187] S. Billings, H. Jamaluddin, and S. Chen, “Properties of neural networks

with applications to modelling non-linear dynamical systems,” International

Journal of Control, vol. 55, no. 1, pp. 193–224, 1992. 98

[188] R. Deo and R. Chandra, “Identification of minimal timespan problem for

recurrent neural networks with application to cyclone wind-intensity pre-

diction,” in International Joint Conference on Neural Networks. IEEE,

2016, pp. 489–496. 98

179

[189] G.-B. Huang, Q.-Y. Zhu, and C. K. Siew, “Real-time learning capability of

neural networks,” IEEE Transactions on Neural Networks, vol. 17, no. 4,

pp. 863–878, 2006. 98

[190] X. Liu, S. Lin, J. Fang, and Z. Xu, “Is extreme learning machine feasible?

a theoretical assessment (part i),” IEEE Transactions on Neural Networks

and Learning Systems, vol. 26, no. 1, pp. 7–20, 2015. 99

[191] W. Wang, W. Hu, and N. Armstrong, “Fatigue crack prognosis using

bayesian probabilistic modelling,” Mechanical Engineering Journal, vol. 4,

no. 5, pp. 16–00 702, 2017. 128

[192] S. Braun, “The synchronous (time domain) average revisited,” Mechanical

Systems and Signal Processing, vol. 25, no. 4, pp. 1087–1102, 2011. 128

[193] Y. Ding, W. He, B. Chen, Y. Zi, and I. W. Selesnick, “Detection of faults

in rotating machinery using periodic time-frequency sparsity,” Journal of

Sound and Vibration, vol. 382, pp. 357–378, 2016. 128

[194] L. Wang, Y. Shao, L. Yin, Y. Yuan, and J. Liu, “Fault diagnosis for centre

wear fault of roll grinder based on a resonance demodulation scheme,” in

Journal of Physics: Conference Series, vol. 842, no. 1. IOP Publishing,

2017, p. 012057. 128

[195] J. Zhang, J. S. Dhupia, and C. J. Gajanayake, “Stator current analysis from

electrical machines using resonance residual technique to detect faults in

planetary gearboxes,” IEEE Transactions on Industrial Electronics, vol. 62,

no. 9, pp. 5709–5721, 2015. 128

[196] H. Zheng, Z. Li, and X. Chen, “Gear fault diagnosis based on continuous

wavelet transform,” Mechanical Systems and Signal Processing, vol. 16, no.

2-3, pp. 447–457, 2002. 128

180

[197] T. Barszcz and R. B. Randall, “Application of spectral kurtosis for detec-

tion of a tooth crack in the planetary gear of a wind turbine,” Mechanical

Systems and Signal Processing, vol. 23, no. 4, pp. 1352 – 1365, 2009. 128

[198] F. Combet and L. Gelman, “Optimal filtering of gear signals for early dam-

age detection based on the spectral kurtosis,” Mechanical Systems and Sig-

nal Processing, vol. 23, no. 3, pp. 652–668, 2009. 128

[199] F. Jia, Y. Lei, J. Lin, X. Zhou, and N. Lu,“Deep neural networks: A promis-

ing tool for fault characteristic mining and intelligent diagnosis of rotating

machinery with massive data,” Mechanical Systems and Signal Processing,

vol. 72, pp. 303–315, 2016. 129

[200] M. Zinkevich, M. Weimer, L. Li, and A. J. Smola, “Parallelized stochastic

gradient descent,” in Advances in Neural Information Processing Systems,

2010, pp. 2595–2603. 129

[201] C. Zhang, P. Lim, A. Qin, and K. C. Tan, “Multiobjective deep belief

networks ensemble for remaining useful life estimation in prognostics,” IEEE

Transactions on Neural Networks and Learning Systems, vol. 28, no. 10, pp.

2306–2318, 2017. 129

[202] D. Verstraete, A. Ferrada, E. L. Droguett, V. Meruane, and M. Modarres,

“Deep learning enabled fault diagnosis using time-frequency image analysis

of rolling element bearings,” Shock and Vibration, vol. 2017, 2017. 129

[203] R. Fletcher and C. M. Reeves, “Function minimization by conjugate gradi-

ents,” The Computer Journal, vol. 7, no. 2, pp. 149–154, 1964. 129

[204] B. D. Forrester, “Advanced vibration analysis techniques for fault detection

and diagnosis in geared transmission systems,” Ph.D. dissertation, Swin-

burne University of Technology Melbourne, VIC, Australia, 1996. 129, 142

181

[205] M. Tao, W. Wang, Z. Man, Z. Cao, H. Le Vu, J. Zheng, and A. Cricenti,

“Structured learning-based sinusoidal modelling for gear diagnosis and prog-

nosis,” in Fracture, Fatigue and Wear. Springer, 2018, pp. 184–193. 142

[206] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Com-

putation, vol. 9, no. 8, pp. 1735–1780, 1997. 152

[207] M. Arjovsky, A. Shah, and Y. Bengio, “Unitary evolution recurrent neu-

ral networks,” in International Conference on Machine Learning, 2016, pp.

1120–1128. 152

[208] Y. Deng, F. Bao, Y. Kong, Z. Ren, and Q. Dai, “Deep direct reinforcement

learning for financial signal representation and trading,” IEEE Transactions

on Neural Networks and Learning Systems, vol. 28, no. 3, pp. 653–664, 2017.

154

[209] J. A. Starzyk and H. He, “Spatio–temporal memories for machine learn-

ing: A long-term memory organization,” IEEE Transactions on Neural Net-

works, vol. 20, no. 5, pp. 768–780, 2009. 155

[210] G. Shi, D. Liu, and Q. Wei, “Energy consumption prediction of office build-

ings based on echo state networks,” Neurocomputing, vol. 216, pp. 478–488,

2016. 155

182

List of Publications

1. M. Tao, Z. Man, J. Zheng, A. L. Cricenti, and W. Wang, “A new dy-

namic neural modelling for mechatronic system prognostics”, in 2016 In-

ternational Conference on Advanced Mechatronic Systems (ICAMechS),

pp. 437-442. IEEE, 2016.

2. M. Tao, W. Wang, Z. Man, Z. Cao, H. L. Vu, J. Zheng, and A. L. Cricenti,

“Structured Learning-Based Sinusoidal Modelling for Gear Diagnosis and

Prognosis”, in Fracture, Fatigue and Wear, pp. 184-193. Springer, Singa-

pore, 2018.

3. M. Tao, W. Wang, Z. Man, Z. Cao, H. L. Vu, J. Zheng, and A. L. Cri-

centi, “Learning from Back-forward Stochastic Convolution for Dynamical

System Modelling with Sequence Output”, submitted to Neural Networks.

4. M. Tao, W. Wang, Z. Man, Z. Cao, H. L. Vu, J. Zheng, and A. L. Cricenti,

“Neural Network Learning based Sinusoidal Modelling for Load Varying

Gear Tooth Cracking Process”, submitted to Mechanical System and Signal

Processing.

5. L. Xie, Y. Yang, Z. Zhou, J. Zheng, M. Tao, and Z. Man, “Dynamic neural

modeling of fatigue crack growth process in ductile alloys”, Information

Sciences, vol. 364-365, pp. 167-183, 2016.

6. L. Zhi, Z. Man, Z. Cao, M. Tao, and P. Wang, “Modeling crack growth

183

of aluminum alloy under variable-amplitude loading using dynamic neu-

ral network”, in 2016 International Conference on Advanced Mechatronic

Systems (ICAMechS), pp. 453-458. IEEE, 2016.

7. L. Xing, Z. Man, J. Zheng, A. L. Cricenti, and M. Tao, “A Robust and

Accurate Neural Predictive Model for Foreign Exchange Market Modelling

and Forecasting”, in 2018 Australian and New Zealand Control Conference

(ANZCC), pp. 419-424. IEEE, 2018.

184

neural network-based dynamical modelling for system

Documents