schizophrenia-mimicking layers outperform conventional

15
1 Schizophrenia-mimicking layers outperform conventional neural network layers Ryuta Mizutani 1 *, Senta Noguchi 1 , Rino Saiga 1 , Mitsuhiro Miyashita 2 , Makoto Arai 2 , and Masanari Itokawa 2,3 1 Department of Applied Biochemistry, Tokai University, Hiratsuka, Kanagawa 259-1292, Japan 2 Tokyo Metropolitan Institute of Medical Science, Setagaya, Tokyo 156-8506, Japan 3 Tokyo Metropolitan Matsuzawa Hospital, Setagaya, Tokyo 156-0057, Japan *[email protected] ABSTRACT We have reported nanometer-scale three-dimensional studies of brain networks of schizophrenia cases and found that their neurites are thin and tortuous compared to healthy controls. This suggests that connections between distal neurons are impaired in microcircuits of the schizophrenia cases. In this study, we applied this biological findings to designing schizophrenia-mimicking artificial neural network to simulate the connection impairment in the disorder. Neural networks having the schizophrenia connection layer in place of fully connected layer were subjected to image classification tasks using MNIST and CIFAR-10 datasets. The obtained results revealed that the schizophrenia connection layer is tolerant to overfitting and outperforms fully connected layer. Schizophrenia-mimicking convolution layer was also tested with the VGG configuration, showing that 60% of kernel weights of the last convolution layer can be eliminated while keeping competitive performance. Schizophrenia-mimicking layers can be used instead of fully-connected or convolution layers without any change in the network configuration and training procedures, hence the outperformance of the schizophrenia- mimicking layer is easily incorporated in neural networks. The results of this study indicate that the connection impairment in schizophrenia is not a burden to the brain, but has some functional roles to attain a better brain performance. We suggest that the seemingly neuropathological alterations observed in schizophrenia have been rationally implemented in our brain during the process of biological evolution. INTRODUCTION Artificial neural network was originally designed by modelling information processing of brain (Rosenblatt, 1958), which is subdivided into functionally-different areas, such as visual cortex of the occipital lobe or auditory cortex of the temporary lobe (Brodmann, 1909; Amunts & Zilles, 2015). Studies on the visual cortex (Hubel & Wiesel, 1959) have led to the inspiration of convolutional neural network (Fukushima, 1980) that now evolved into wide varieties of network configurations (Simonyan & Zisserman, 2014; He et al., 2016). Therefore, structural analysis of real brain network and incorporation of resultant biological knowledges into artificial intelligence algorithms should have a potential to further improve the performance of machine learning. Analysis of not only healthy control cases but also psychiatric disorder cases can provide a clue to invent an unprecedented design of artificial neural network. It has been reported that polygenic risk scores for schizophrenia and bipolar disorder were

Upload: others

Post on 14-Jan-2022

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Schizophrenia-mimicking layers outperform conventional

1

Schizophrenia-mimicking layers outperform conventional neural network layers Ryuta Mizutani1*, Senta Noguchi1, Rino Saiga1, Mitsuhiro Miyashita2, Makoto Arai2, and Masanari Itokawa2,3 1Department of Applied Biochemistry, Tokai University, Hiratsuka, Kanagawa 259-1292, Japan 2Tokyo Metropolitan Institute of Medical Science, Setagaya, Tokyo 156-8506, Japan 3Tokyo Metropolitan Matsuzawa Hospital, Setagaya, Tokyo 156-0057, Japan

*[email protected]

ABSTRACT We have reported nanometer-scale three-dimensional studies of brain networks of

schizophrenia cases and found that their neurites are thin and tortuous compared to healthy controls. This suggests that connections between distal neurons are impaired in microcircuits of the schizophrenia cases. In this study, we applied this biological findings to designing schizophrenia-mimicking artificial neural network to simulate the connection impairment in the disorder. Neural networks having the schizophrenia connection layer in place of fully connected layer were subjected to image classification tasks using MNIST and CIFAR-10 datasets. The obtained results revealed that the schizophrenia connection layer is tolerant to overfitting and outperforms fully connected layer. Schizophrenia-mimicking convolution layer was also tested with the VGG configuration, showing that 60% of kernel weights of the last convolution layer can be eliminated while keeping competitive performance. Schizophrenia-mimicking layers can be used instead of fully-connected or convolution layers without any change in the network configuration and training procedures, hence the outperformance of the schizophrenia- mimicking layer is easily incorporated in neural networks. The results of this study indicate that the connection impairment in schizophrenia is not a burden to the brain, but has some functional roles to attain a better brain performance. We suggest that the seemingly neuropathological alterations observed in schizophrenia have been rationally implemented in our brain during the process of biological evolution.

INTRODUCTION

Artificial neural network was originally designed by modelling information processing of brain (Rosenblatt, 1958), which is subdivided into functionally-different areas, such as visual cortex of the occipital lobe or auditory cortex of the temporary lobe (Brodmann, 1909; Amunts & Zilles, 2015). Studies on the visual cortex (Hubel & Wiesel, 1959) have led to the inspiration of convolutional neural network (Fukushima, 1980) that now evolved into wide varieties of network configurations (Simonyan & Zisserman, 2014; He et al., 2016). Therefore, structural analysis of real brain network and incorporation of resultant biological knowledges into artificial intelligence algorithms should have a potential to further improve the performance of machine learning.

Analysis of not only healthy control cases but also psychiatric disorder cases can provide a clue to invent an unprecedented design of artificial neural network. It has been reported that polygenic risk scores for schizophrenia and bipolar disorder were

Page 2: Schizophrenia-mimicking layers outperform conventional

2

associated with artistic society membership or creative profession (Power et al., 2015). A higher incidence of psychiatric disorders was found for geniuses and their families than for the average population (Juda, 1949). Possible distinguishing feature of neuronal networks of the psychiatric cases can be applied to designing an unconventional architecture of artificial intelligence.

We have recently reported nanometer-scale three-dimensional studies of neuronal network of the anterior cingulate cortex and the superior temporal gyrus of schizophrenia cases and age/gender-matched controls by using synchrotron radiation nanotomography or nano-CT (Mizutani et al., 2019; Mizutani et al., 2020). The results indicated that neurites of the schizophrenia cases are thin and tortuous, while that of the control cases are thick and straight. The nano-CT analysis also revealed that the neurite diameter is proportional to the diameter of dendritic spines which form synaptic connections between neurons. It has been reported that thinning of neurites or spines attenuates firing efficiency of neurons (Spruston, 2008), and hence affects the activity of their belonging areas.

In this study, we implemented these biological findings to artificial neural networks to delineate 1) how much the brain can persist structural alteration of neurons observed in schizophrenia and 2) how we can incorporate those findings to artificial neural network to improve its performance. The analyses were performed by using newly designed layers that mimic the connection impairment in schizophrenia. The obtained results indicated that schizophrenia-mimicking layers tolerate parameter reduction up to 80% of weights and outperform conventional neural network layers.

METHODS

The etiology of schizophrenia has been discussed from neurodegenerative and neurodevelopmental standpoints (Allin & Murray, 2002; Gupta & Kulhara, 2010). The neurodegenerative hypothesis claims that schizophrenia is a disorder due to the degeneration in the brain. Another neurodevelopmental hypothesis proposes that the brain network is formed abnormally during the developmental process. The etiology of schizophrenia has also been discussed on the basis of two-hit hypothesis (Maynard et al., 2001), of which "first hit" during early development primes the pathogenic response and "second hit" later in the life causes the disorder. Diverse symptomatic aspects of schizophrenia have been classified into two types by taking into account of the two-hit hypothesis (Crow, 1985). Type I is characterized with positive symptoms such as hallucinations and delusions and accompanies no intellectual impairment, while Type II shows affective flattening and has been considered to be a "defect state" (Crow, 1980).

In this study, we translated these disorder classifications into two working models of artificial neural network. The first one is disorganized model. This model mimics neurodegeneration after the formation of cerebral neuronal network. This can be simulated by using artificial neural networks that are trained as normal first, and then disorganized next, so that the posteriori disorganization simulates neurodegeneration after the formation of neuronal network. Another working model is a developmental

Page 3: Schizophrenia-mimicking layers outperform conventional

3

model, in which we assume concurrent progress of neuropathological changes and brain development. This developmental model can be simulated by implementing a connection-impaired layer into the neural network, which is trained under the impairment. Resultant performance of these models should reveal their relevance to the disorder types.

The thinning of neurites in schizophrenia (Mizutani et al., 2019, 2020) should hinder transmission of input potentials depending on neurite lengths from soma (Spruston, 2008). Therefore, distal synaptic connections should be deteriorated worse than proximal connections. This can be reproduced in the artificial neural network by defining a distance measure between nodes and by damping connection parameters depending on the distance measure. Here we assume one-dimensional arrangement of nodes and define distance dij of a connection between input node ix and output node jy with 𝑑 𝑟 ∙ 𝑖 𝑖 √𝑟 1⁄ , where 𝑟 𝑛 𝑛⁄ is the ratio of number of nodes between the target and the preceding layers. This distance measure is equal to a Euclidean distance from the diagonal in the weight matrix. Window matrix was prepared by using the distance measure to modify weight matrix. Figure 1 shows examples of window matrix having identical numbers of inputs and outputs. Diagonal connection impairment (Figure 1b) is performed by zeroing weight parameters if their distances from the diagonal are larger than a threshold. This can be implemented by masking the weight matrix with a window matrix F 𝑓 , where elements fij distal to the diagonal are set to 0 and elements fij proximal to the diagonal are set to 1. Gaussian window (Figure 1c) uses matrix elements fij of a Gaussian form: 𝑓 exp 𝑑 2𝜎⁄ , where σ represents the window width. Other window variations (Figure 1d–f) designed aside from the abovementioned distance idea are also shown in Figure 1. The random window does not depend on the matrix geometry, hence can be applied to convolution layers. Parameter reduction ratio was defined with the ratio between the sum of window elements ∑ 𝑓 and the total number of weights. The weight matrix was multiplied by the window matrix in an element-by-element manner, and then normalized with the parameter reduction ratio so as to keep the sum of weights unchanged.

Figure 1. Schematic drawings of weight window. These drawings assume 5 × 5 square matrix of the weight. Each tiny box indicates element of the matrix. Window values are represented with gray scale from 0 (white) to 1 (black). (a) Fully connected network. (b) A diagonal window representing connection impairment in schizophrenia. Weight elements indicated with black boxes were used for training and/or evaluation. Elements of open boxes were set to zero, and were not used in the training and/or evaluation. (c) Gaussian window mimics distance-dependent gradual decrease in the real neuronal connection. (d) Stripe window. (e) Centered window. (f) Random window.

Page 4: Schizophrenia-mimicking layers outperform conventional

4

Influences of these connection impairments on the neural network were examined using the MNIST handwritten digits dataset (LeCun et al., 1998) and the CIFAR-10 picture dataset (Krizhevsky, 2009). Hereafter we call fully-connected layer masked with schizophrenia window as "schizophrenia connection layer", and convolution layer with schizophrenia window as "schizophrenia convolution layer". Network configurations used for the image classification tasks are summarized in Table 1 and fully described in Supplementary Table 1. Simple 3- and 4-layer configurations (Table 1, networks A and B) were used for MNIST classification tasks. Network A with one schizophrenia connection layer as a hidden layer was used for the analysis of developmental model, in which the connection impairment was incorporated in the training and the evaluation. Network B was used for the analysis of disorganized model, in which the connection impairment was incorporated only in the evaluation step. In network B, a pair of fully-connected layer and schizophrenia connection layer with identical numbers of nodes were used as hidden layers to prepare different size of square weight matrixes, which were used to analyze the effect of weight matrix dimension on the connection impairment. Convolutional networks C–E (Table 1) were used for classification tasks of the CIFAR-10 dataset. The configuration of networks C and D was taken from the Keras example code and used for testing schizophrenia connection layer as top layers. Network C was used for the analysis of developmental model and D for disorganized model. Network E was used for testing schizophrenia convolution layer in the VGG configuration (Simonyan & Zisserman, 2014). Convolutional blocks of network E were taken from the VGG16 network, while batch normalization (Ioffe & Szegedy, 2015) and dropout (Srivastava et al., 2014) layers were incorporated (Supplementary Table 1e). Kernel weights in the schizophrenia convolution layer were randomly masked with the window. Biases were enabled in all layers, except for schizophrenia-mimicking layers in the disorganized model. This is because bias values can be refined in the developmental model but cannot be modified according to the inter-node distance in the evaluation step. The ReLU activation function (Glorot et al., 2011) was used for all hidden layers, and softmax function for output layers. Hidden layers were initialized with He's method (He et al., 2015). Networks A, B and E were trained using the Adam algorithm (Kingma & Ba, 2014). Network A and B were trained with a learning rate of 1 × 10-3. Network E were trained with a learning rate of 5 × 10-4 first, and then with 1 × 10-4 after 150 epochs. Network C and D were trained using the RMSprorp algorithm (Tieleman & Hinton, 2012) with a learning rate of 1 × 10-4 and its decay of 1 × 10-6. Batch sizes were set to 32 for networks A–D, and 200 for network E. Data augmentation (Wong et al., 2016) were introduced in the training of network E.

Training and evaluation of the networks were conducted using Tensorflow 2.3.0 and Keras 2.4.0 running on c5a.xlarge (4 vCPUs of AMD EPYC processors operated at 2.8 GHz) or c5a.2xlarge (8 vCPUs) instance of Amazon Web Service. CPU time required for training and evaluation of networks using schizophrenia-mimicking layers was slightly shorter than those using normal layers, though the incorporation of Gaussian window required additional time to initialize its window elements. Python codes used in this study are available from our GitHub repository (https://mizutanilab.github.io). Statistical analyses were conducted using R 3.4.3. Significance was defined as p < 0.05.

Page 5: Schizophrenia-mimicking layers outperform conventional

5

Table 1. Network configuration. Number in parenthesis represents number of nodes or number of filters. Sz, schizophrenia connection layer; FC, fully connected layer; FC-Sz, trained as fully connected layer and evaluated using schizophrenia window; Conv, 2-dimensional convolution layer with kernel size of 3 × 3; VGG16Conv3, the first 3 convolutional blocks of the VGG16 network; SzConv, 2-dimensional schizophrenia convolution layer with kernel size of 3 × 3. *Dimensions of these hidden layers were set equal to each other and varied to examine its effect on the connection impairment.

A B C D E

Input

(28 × 28)

Input

(28 × 28)

Input

(32 × 32 RGB)

Input

(32 × 32 RGB)

Input

(32 × 32 RGB)

Sz (512) FC (64–1024)* Conv (32) Conv (32) VGG16Conv3

Output (10) FC-Sz (64–1024)* Conv (32) Conv (32) SzConv (512)

Output (10) Maxpool Maxpool SzConv (512)

Conv (64) Conv (64) SzConv (512)

Conv (64) Conv (64) Maxpool

Maxpool Maxpool FC (4096)

Sz (512) FC-Sz (512) FC (4096)

Output (10) Output (10) FC (1024)

Output (10)

RESULTS

Figure 2 summarizes relationships between the connection impairment and the classification error in the disorganized model, in which training precedes the impairment. Figure 2a shows the dependence on window shape in the MNIST classification tasks, which were conducted using a 4-layer network (Table 1, B). Weight parameters between two hidden layers of identical numbers of nodes were modified after training to mimic neurodegeneration after the formation of neuronal network. The resultant modified network was subjected to the evaluation using validation dataset. The obtained results indicated that the network of this configuration can persist parameter reduction up to approximately 60% of connections between the hidden layers (Figure 2a). The profiles showed little dependence on window shape except for centered window, indicating that the contribution of each weight element to the entire network performance is equivalent independently of their position in the weight matrix. Hence, we mainly used diagonal window (Figure 1b) in the following analyses.

Figure 2b shows the dependence on the number of epochs in the disorganized model of MNIST tasks. The results indicated that the network became slightly more sensitive as the training became longer, suggesting that the redundancy of weight matrix elements decreased after the long duration of training. Figure 2c shows the relation between the number of nodes and the persistence against the connection impairment. Network having 64 or 128 nodes in the hidden layers was prone to error increase due to the parameter reduction, while networks having 256 nodes or more in the hidden layers persisted parameter reduction up to 60–80% of weights. These results indicate that the number of parameters of networks having 256 nodes or more is sufficient to store the

Page 6: Schizophrenia-mimicking layers outperform conventional

6

trained information. In contrast, CIFAR-10 classification task using network D (Table 1) showed error increase nearly proportional to the parameter reduction (Figure 2d). This indicates that the information acquired with training is uniformly but not redundantly distributed in the top layer, resulting in a low persistence of the network against the parameter reduction.

(a) (b)

(c) (d)

Figure 2. Relationship between the classification error and the parameter reduction in disorganized model. Networks were trained first and then weights of target layers were masked with window in the evaluation. The left intercept corresponds to a network with fully connected layers. Panels a–c show results of MNIST classification. Unless otherwise stated, network B having 512 nodes in two hidden layers was trained for 20 epochs and then its schizophrenia connection layer was masked using a diagonal window in the evaluation. The training and evaluation were repeated for 100 sessions and their mean errors were plotted. (a) Effect of window shape. (b) Dependence on training duration. (c) Dependence on hidden layer dimensions. (d) Result of CIFAR-10 classification. Network D was trained for 100 epochs and its top layer was masked using a diagonal window. The training and evaluation was repeated for 30 sessions and their mean errors were plotted.

Page 7: Schizophrenia-mimicking layers outperform conventional

7

Another model called developmental model showed distinct features that were not observed for the disorganized models. Figure 3a shows training progress of CIFAR-10 classification tasks of the developmental model, in which weights of top layer were masked with a diagonal window throughout the training and the evaluation. The task was performed using network C consisted of 2 blocks of convolution layers and one schizophrenia-connection top layer. A network with the same configuration but having a fully-connected top layer was used as a control. The obtained results revealed outperformance of the schizophrenia network. The control network showed overfitting approximately after 75 epochs of training, whereas the schizophrenia network showed continuous error decline down to 200 epochs. The classification error of the schizophrenia network was significantly lower than that of the control even before the overfitting (p = 0.014 at 75 epochs, and p = 1.1 × 10-5 at 200 epochs, two-sided Wilcoxon test, n1 = n2 = 10).

The connection impairment of schizophrenia was also introduced into convolution layers by masking kernel weights with the random window. The schizophrenia convolution layer was implemented in the VGG configuration to perform CIFAR-10 classification tasks. Figure 3b shows their training progresses. A schizophrenia network with 60% reduction of kernel weights of the last convolution layer showed a slightly better performance compared to the control network. Another schizophrenia network with 60% parameter reduction in the last 3 convolution layers showed a performance comparable to the network without the parameter reduction. These results indicate that the convolution layers of this network contain parameter redundancy that can be eliminated by using the schizophrenia convolution layer.

(a) (b)

Figure 3. Training progress of CIFAR-10 classification tasks. (a) Network C with a schizophrenia connection layer was trained for 10 sessions and resultant errors are plotted in red. Parameter reduction in the schizophrenia connection layer was set to 50%. Errors of control network with the same configuration but having a fully connected layer are plotted in black. (b) VGG networks with schizophrenia convolution layers (network E) were trained and resultant errors are plotted. A network with 60% parameter reduction in the last convolution layer is drawn in red, and that with 60% reduction in the last 3 convolution layers in orange. A control network without the parameter reduction is drawn in black.

Page 8: Schizophrenia-mimicking layers outperform conventional

8

Relation between the performance and the parameter reduction ratio in the developmental model were also examined. Figure 4a shows results of MNIST classification tasks using a 3-layered network (Table 1, A). Classification error of the schizophrenia network gradually decreased and outperformed control network as the parameter reduction was increased up to 70%. Profiles obtained using diagonal and Gaussian windows were almost the same, though the Gaussian window showed a better persistence against the parameter reduction than the diagonal window. This better persistence of the Gaussian window suggests that bilateral tails of Gaussian function allowed weak connections between distal nodes and mitigated weight masking with the window. Figure 4b shows results of CIFAR-10 classification tasks using network C. Relationship between the error and the parameter reduction was similar to that observed for the MNIST tasks. The profiles shifted toward lower-right with the longer duration of training, indicating that the schizophrenia connection layer with a higher parameter reduction exhibits better performance by training longer.

(a) (b)

Figure 4. Relationship between the error and the parameter reduction in the developmental model, in which weights of schizophrenia connection layers were masked with window throughout the training and evaluation. The left intercepts correspond to a network with fully connected layers. (a) MNIST classification using network A. Training and evaluation were repeated for 100 sessions and their mean classification errors were plotted. Errors after 10 epochs are plotted with broken lines and errors after 50 epochs with solid lines. (b) CIFAR-10 classification using network C. Training and evaluation were repeated for 10 sessions and their mean errors were plotted. A diagonal window was used in this task. Errors after 50 epochs are plotted with broken lines and errors after 75 epochs with solid lines.

RELATED WORKS

Neural network was first developed by incorporating biological findings, although structural aspects of neurons of psychiatric disorder cases have not been introduced into the artificial intelligence. This is probably because neuropathology of psychiatric disorders has not been three-dimensionally delineated (Itokawa et al., 2020) before our

Page 9: Schizophrenia-mimicking layers outperform conventional

9

reports regarding the nanometer-scale structure of neurons of schizophrenia cases (Mizutani et al., 2019, 2020). A method called "optimal brain damage" (Le Cun et al, 1990) has been proposed to remove unimportant weights to reduce the number of parameters, although its relation to biological findings such as those regarding brain injury has not been explicitly described.

Parameter reduction and network pruning have been suggested as strategies to simplify the network. It has been reported that simultaneous regularization during training can reduce network connections while keeping competitive performances (Scardapane et al., 2017). A method to regularize the network structure including filter shapes and layer depth has been reported to allow the network to learn more compact structures without accuracy loss (Wen et al., 2016). A study on network pruning suggested that careful evaluations of the structured pruning method are needed (Liu et al., 2018). Elimination of zero weights after training has been proposed as a method to simplify the network (Yaguchi et al., 2018). Improvements of accuracy have been reported for the regularized networks (Scardapane et al., 2017; Yaguchi et al., 2018), though dedicated algorithms or procedures are required to remove parameters during training in these parameter reduction methods.

DISCUSSION

Parameter reduction in schizophrenia-mimicking layers can be regarded as enforced and predefined L1 regularization. Different weight windows gave almost the same results (Figure 2a and 4a), indicating that weight matrix elements are equivalent each other independently of their position in the matrix. Training of schizophrenia- mimicking network requires no modification of optimization algorithm, since its parameter reduction is arbitrarily configured a priori. Schizophrenia-mimicking layers can be used instead of conventional layers without any changes in the network configuration and training procedures. The outperformance of the schizophrenia connection layer therefore can be incorporated just by replacing fully connected layers of any kind of neural networks with the schizophrenia connection layer.

The structure of window matrix of schizophrenia-mimicking layer indicates the importance of connecting all nodes but at the same time dividing them into groups so that each group can process information independently and integratively. Results shown in Figure 4 indicate that the performance optimum of the schizophrenia connection layer situates nearer to the division than to the integration. We recommend 50–70% parameter reduction as a first choice in order to obtain a presumably best result.

Neurite thinning in cerebral tissues of the anterior cingulate cortex and the superior temporal gyrus of schizophrenia cases (Mizutani et al., 2019, 2020) suggested that connection impairments observed in these brain areas should have relationship to psychiatric symptoms of schizophrenia. However, simulation results of the present study indicated that the connection impairment can improve the performance of network as a whole. Although meta-analysis studies indicated that schizophrenia is associated with intellectual deficits (Aylward et al., 1984; Khandaker et al., 2011), it has also been

Page 10: Schizophrenia-mimicking layers outperform conventional

10

reported that creativity and psychosis share common genetic roots (Power et al., 2015) and that geniuses and their families show a higher incidence of psychiatric disorders than the average population (Juda, 1949). The outstanding talents of the psychiatric disorder patients are not contradictory to the outperformance of the schizophrenia connection layer. Although the improvement in each network was subtle, the difference should become remarkable if the entire brain operate with the outperformance observed in this study.

The schizophrenia connection layer surpassed the fully connected layer in the developmental model, which assumes brain development under the connection impairment. In contrast, no performance improvements were observed in the disorganized model, which assumes neurodegeneration after the network formation. Crow proposed a classification of schizophrenia cases into two types (Crow, 1980, 1985), of which type I shows positive symptoms with relatively benign prognosis, while type II shows negative symptoms and progresses irreversibly. We suggest that positive symptoms of type I can be regarded as hyperactivity of cerebral cortex due to its outperformance. In contrast, negative symptoms of type II is ascribable to the network malfunction due to the excess impairment observed as profile surges in Figures 2 and 4.

N-methyl-D-aspartate (NMDA) receptor antagonists including phencyclidine cause schizophrenia-like symptoms (Coyle, 2012). Morphological changes of cortical neurons such as vacuole formation (Olney et al., 1989) and corkscrew deformity of dendrites (Wozniak et al., 1998) have been reported for animal models of the NMDA hypofunction. These structural burdens on neurons should suppress active potentials originating from distal dendrites rather than those from proximal dendrites, since the cable theory claims that input potentials are attenuated depending on the distance. Therefore, we suggest that the present results are consistent with observations in the NMDA hypofunction models.

The neural networks of this study use only thousands of nodes per model and cannot represent the entire network of human brain. Therefore, relation between the connection impairment applied to schizophrenia-mimicking layer and the white matter dysconnectivity revealed from diffusion tensor imaging (Son et al., 2015) should further be investigated. Another limitation of this study is that the present model cannot provide any suggestion regarding other psychiatric disorders including autism, since the present study is based only on the structure of schizophrenia brain tissues.

The profiles shown in Figure 4 illustrated that the outperformance lies side-by-side with the malfunction. The evolutionary process should have scanned or be scanning this profile to find a best performance of the brain, while not to deteriorate its function. The results of this study along with the abovementioned relationship between creativity and psychosis (Power et al., 2015) suggest that the connection impairment during network development is not a burden to the brain, but has some functional roles in the cortical microcircuit performance. We suggest that the connection impairment found in schizophrenia cases (Mizutani et al., 2019, 2020) is rationally implemented in our brain in the process of human being evolution.

Page 11: Schizophrenia-mimicking layers outperform conventional

11

CONFLICT OF INTEREST

MA and MI declare a conflict of interest, being authors of several patents regarding therapeutic use of pyridoxamine for schizophrenia. All other authors declare no conflict of interest.

ACKNOWLEDGEMENTS

This work was supported by Grants-in-Aid for Scientific Research from the Japan Society for the Promotion of Science (nos. 21611009, 25282250, and 25610126), and by the Japan Agency for Medical Research and Development under grant nos. JP18dm0107088, JP19dm0107088h0004, and JP20dm0107088h0005. The structural analyses of human brain tissues have been conducted at the SPring-8 synchrotron radiation facility under proposals of 2011A0034, 2014A1057, 2014B1083, 2015A1160, 2015B1101, 2016B1041, 2017A1143, 2018A1164, 2018B1187, 2019A1207, and 2019B1087, and at the Advanced Photon Source of Argonne National Laboratory under General User Proposals of GUP-45781 and GUP-59766. The studies on brain tissues used resources of the Advanced Photon Source, a U.S. Department of Energy (DOE) Office of Science User Facility operated for the DOE Office of Science by Argonne National Laboratory under Contract No. DE-AC02-06CH11357.

REFERENCES

1. Rosenblatt F. The perceptron: A probabilistic model for information storage and organization in the brain. Psychol Rev 65, 386–408, 1958.

2. Brodmann K. Vergleichende Lokalisationslehre der Großhirnrinde in ihren Prinzipien dargestellt auf Grund des Zellenbaues, Leipzig: Johann Ambrosius Barth, 1909.

3. Amunts K, Zilles K. Architectonic mapping of the human brain beyond Brodmann. Neuron 88, 1086–1107, 2015.

4. Hubel DH, Wiesel TN. Receptive fields of single neurones in the cat's striate cortex. J Physiol 148, 574–591, 1959.

5. Fukushima K. Neocognitron: a self organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol Cybern 36, 193–202, 1980.

6. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv: 1409.1556, 2014.

7. He K, Zhang X; Ren S; Sun J. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

8. Power RA, Steinberg S, Bjornsdottir G, Rietveld CA, Abdellaoui A, Nivard MM. Polygenic risk scores for schizophrenia and bipolar disorder predict creativity. Nat Neurosci 18, 953–955, 2015.

9. Juda A. The relationship between highest mental capacity and psychic abnormalities. Am J Psychiatry 106, 296–307, 1949.

10. Mizutani R, Saiga R, Takeuchi A, Uesugi K, Terada Y, Suzuki Y, De Andrade V, De Carlo F, Takekoshi S, Inomoto C, Nakamura N, Kushima I, Iritani S, Ozaki N, Ide S, Ikeda K, Oshima K, Itokawa M, Arai M. Three-dimensional alteration of neurites in schizophrenia. Transl Psychiatry 9, 85, 2019.

11. Mizutani R, Saiga R, Yamamoto Y, Uesugi M, Takeuchi A, Uesugi K, Terada Y, Suzuki Y, De Andrade V, De Carlo F, Takekoshi S, Inomoto C, Nakamura N, Kushima I, Iritani S, Ozaki N, Oshima K, Itokawa M, Arai M. Structural diverseness of neurons between brain areas and between cases. arXiv: 2007.00212, 2020.

12. Spruston N. Pyramidal neurons: dendritic structure and synaptic integration. Nat Rev Neurosci 9, 206–221, 2008.

13. Allin M, Murray R. Schizophrenia: a neurodevelopmental or neurodegenerative disorder?

Page 12: Schizophrenia-mimicking layers outperform conventional

12

Current Opinion in Psychiatry 15, 9–15, 2002. 14. Gupta S, Kulhara P. What is schizophrenia: A neurodevelopmental or neurodegenerative

disorder or a combination of both? A critical analysis. Indian J Psychiatry 52, 21–27, 2010. 15. Maynard TM, Sikich L, Lieberman JA, LaMantia AS. Neural development, cell-cell

signaling, and the "two-hit" hypothesis of schizophrenia. Schizophr Bull 27, 457–476, 2001.

16. Crow TJ. The two-syndrome concept: origins and current status. Schizophr Bull 11, 471–486, 1985.

17. Crow TJ. Molecular pathology of schizophrenia: more than one disease process? Br Med J. 280, 66–68, 1980.

18. LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 2278–2324, 1998.

19. Krizhevsky A. Learning multiple layers of features from tiny images. University of Toronto, 2009.

20. Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv: 1502.03167, 2015.

21. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: A simple way to prevent neural networks from overfitting. J Machine Learning Res 15, 1929–1958, 2014.

22. Glorot X, Bordes A, Bengio Y. Deep sparse rectifier neural networks. Proc. Fourteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings 15, 315–323, 2011.

23. He K, Zhang X, Ren S, Sun J. Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. Proc. IEEE International Conference on Computer Vision, 2015.

24. Kingma DP, Ba J. Adam: a method for stochastic optimization. arXiv: 1412.6980, 2014. 25. Tieleman T, Hinton E. Lecture 6.5 - rmsprop, COURSERA: Neural networks for machine

learning, 2012. 26. Wong SC, Gatt A, Stamatescu V, McDonnell MD. Understanding data augmentation for

classification: when to warp? International Conference on Digital Image Computing: Techniques and Applications (DICTA), 2016.

27. Itokawa M, Oshima K, Arai M, Torii Y, Kushima I et al., Cutting‐edge morphological studies of post‐mortem brains of patients with schizophrenia and potential applications of X‐ray nanotomography (nano‐CT). Psychiatry Clin Neurosci 74, 176–182, 2020.

28. Le Cun Y, Denker JS, Solla SA. Optimal brain damage. Advances in Neural Information Processing Systems 2, 598–605, 1990.

29. Scardapane S, Comminiello D, Hussain A, Uncini A. Group sparse regularization for deep neural networks. Neurocomputing 241, 81–89, 2017.

30. Wen W, Wu C, Wang Y, Chen Y, Li H. Learning structured sparsity in deep neural networks. 30th Conference on Neural Information Processing Systems, 2016.

31. Liu Z, Sun M, Zhou T, Huang G, Darrell T. Rethinking the value of network pruning arXiv: 1810.05270, 2018.

32. Yaguchi A, Suzuki T, Asano W, Nitta S, Sakata Y, Tanizawa A. Adam induces implicit weight sparsity in rectifier neural networks. 17th IEEE International Conference on Machine Learning and Applications (ICMLA), 2018.

33. Aylward E, Walker E, Bettes B. Intelligence in schizophrenia: meta-analysis of the research. Schizophr Bull 10, 430–459, 1984.

34. Khandaker GM, Barnett JH, White IR, Jones PB. A quantitative meta-analysis of population-based studies of premorbid intelligence and schizophrenia. Schizophr Res 132, 220–227, 2011.

Page 13: Schizophrenia-mimicking layers outperform conventional

13

35. Coyle JT. NMDA receptor and schizophrenia: a brief history. Schizophr Bull 38, 920–926, 2012.

36. Olney JW, Labruyere J, Price MT. Pathological changes induced in cerebrocortical neurons by phencyclidine and related drugs. Science 244, 1360–1362, 1989.

37. Wozniak DF, Dikranian K, Ishimaru MJ, Nardi A, Corso TD, Tenkova T, Olney JW, Fix AS. Disseminated corticolimbic neuronal degeneration induced in rat brain by MK-801: potential relevance to Alzheimer's disease. Neurobiol Dis 5, 305–322, 1998.

38. Son S, Kubota M, Miyata J, Fukuyama H, Aso T, Urayama S, Murai T, Takahashi H. Creativity and positive symptoms in schizophrenia revisited: Structural connectivity analysis with diffusion tensor imaging. Schizophr Res 164, 221–226, 2015.

Supplementary Table 1. (a) Configuration of network A. Sz, schizophrenia connection layer. Number-of-parameters column shows number of trainable parameters before the parameter reduction.

Layer Output size Number of parameters

Options

Input 28 × 28

Sz 512 401,920 Parameter reduction: 0–97.5%

Output 10 5,130

Supplementary Table 1. (b) Configuration of network B. FC-Sz, trained as fully connected layer and evaluated using schizophrenia window. Number-of-parameters column shows number of trainable parameters before the parameter reduction. *Dimensions of these hidden layers were set equal to each other and varied to examine its effect on the connection impairment.

Layer Output size Number of parameters

Options

Input 28 × 28

Fully connected

64

–1024*

50,240

–803,840

FC-Sz 64

–1024*

4,096

–1,048,576

Bias vector was disabled.

Output 10 650

–10,250

Page 14: Schizophrenia-mimicking layers outperform conventional

14

Supplementary Table 1. (c) Configuration of network C. Sz, schizophrenia connection layer; Conv, 2-dimensional convolution layer with kernel size of 3 × 3. Number-of-parameters column shows number of trainable parameters before the parameter reduction.

Layer Output size Filter size

Number of parameters

Options

Input 32 × 32 RGB

Conv 32 × 32 × 32 32 896 Zero padding

Conv 30 × 30 × 32 32 9,248 No padding

Maxpool 15 × 15 × 32 Pooling 2 × 2, dropout: 25%

Conv 15 × 15 × 64 64 18,496 Zero padding

Conv 13 × 13 × 64 64 36,928 No padding

Maxpool 6 × 6 × 64

(= 2304)

Pooling 2 × 2, dropout: 25%

Sz 512 1,180,160 Parameter reduction: 0–95%

Output 10 5,130

Supplementary Table 1. (d) Configuration of network D. FC-Sz, trained as fully connected layer and evaluated using schizophrenia window. Conv, 2-dimensional convolution layer with kernel size of 3 × 3. Number-of-parameters column shows number of trainable parameters before the parameter reduction.

Layer Output size Filter size

Number of parameters

Options

Input 32 × 32 RGB

Conv 32 × 32 × 32 32 896 Zero padding

Conv 30 × 30 × 32 32 9,248 No padding

Maxpool 15 × 15 × 32 Pooling 2 × 2, dropout: 25%

Conv 15 × 15 × 64 64 18,496 Zero padding

Conv 13 × 13 × 64 64 36,928 No padding

Maxpool 6 × 6 × 64

(= 2304)

Pooling 2 × 2, dropout: 25%

FC-Sz 512 1,179,648 Bias vector was disabled.

Output 10 5,130

Page 15: Schizophrenia-mimicking layers outperform conventional

15

Supplementary Table 1. (e) Configuration of network E. Conv, 2-dimensional convolution layer; SzConv, 2-dimensional schizophrenia convolution layer; FC, fully connected layer. Kernel size of 3 x 3 was used for all the convolution layers. Number-of-parameters column shows number of trainable parameters before the parameter reduction. Layer Output size Filter

sizeNumber of parameters

Options

Input 32 × 32 RGB Conv 32 × 32 × 64 64 1,792 Zero paddingBatch norm. 256Conv 32 × 32 × 64 64 36,928 Zero paddingMaxpool 16 × 16 × 64 Pooling 2 × 2, dropout: 25% Conv 16 × 16 × 128 128 73,856 Zero paddingBatch norm. 512Conv 16 × 16 × 128 128 147,584 Zero paddingMaxpool 8 × 8 × 128 Pooling 2 × 2, dropout: 25% Conv 8 × 8 × 256 256 295,168 Zero paddingBatch norm. 1,024Conv 8 × 8 × 256 256 590,080 Zero paddingBatch norm. 1,024Conv 8 × 8 × 256 256 590,080 Zero paddingMaxpool 4 × 4 × 256 Pooling 2 × 2Conv 4 × 4 × 512 512 1,180,160 Zero paddingBatch norm. 2,048Conv 4 × 4 × 512 512 2,359,808 Zero paddingBatch norm. 2,048Conv 4 × 4 × 512 512 2,359,808 Zero paddingMaxpool 2 × 2 × 512 Pooling 2 × 2SzConv 2 × 2 × 512 512 2,359,808 Zero padding

Parameter reduction: 0 or 60%Batch norm. 2,048SzConv 2 × 2 × 512 512 2,359,808 Zero padding

Parameter reduction: 0 or 60%Batch norm. 2,048SzConv 2 × 2 × 512 512 2,359,808 Zero padding

Parameter reduction: 0 or 60%Maxpool 1 × 1 × 512 Pooling 2 × 2FC 4096 2,101,248FC 4096 16,781,312FC 1024 4,195,328Output 10 10,250