are neural nets modular? inspecting their functionality...

15
Are Neural Nets Modular? Inspecting Their Functionality Through Differentiable Weight Masks obert Csord ´ as 1 Sjoerd van Steenkiste 1 urgen Schmidhuber 12 Abstract Neural networks (NNs) whose subnetworks im- plement reusable functions are expected to of- fer numerous advantages, e.g., compositional- ity through efficient recombination of functional building blocks, interpretability, preventing catas- trophic interference by separation, etc. Under- standing if and how NNs are modular could pro- vide insights into how to improve them. Current inspection methods, however, fail to link mod- ules to their function. We present a novel method based on learning binary weight masks to iden- tify individual weights and subnets responsible for specific functions. This powerful tool shows that typical NNs fail to reuse submodules, becom- ing redundant instead. It also yields new insights into known generalization issues with the SCAN dataset. Our method also unveils class-specific weights of CNN classifiers, and shows to which extent classifications depend on them. Our find- ings open many new important directions for fu- ture research. 1. Introduction Modularity is an important organization principle in both artificial (Baldwin & Clark, 2000; Schmidhuber, 1990) and biological (Clune et al., 2013; von Dassow & Munro, 1999; Lorenz et al., 2011) systems. It provides a natural way of achieving compositionality, which seems essential for systematic generalization, one of the areas where typical artificial neural networks (NNs) do not yet perform well (Fodor et al., 1988; Marcus, 1998; Lake & Baroni, 2018; Hupkes et al., 2019). Recently, NNs with explicitly designed modules have demonstrated superior generalization cabilities (Chang et al., 2019; Bahdanau et al., 2019; Clune et al., 2013; Kirsch et al., 1 The Swiss AI Lab, IDSIA / USI / SUPSI 2 NNAISENSE. Correspondence to: R´ obert Csord´ as <[email protected]>. 2020 ICML Workshop on Human Interpretability in Machine Learning (WHI 2020). Copyright 2020 by the author(s). 2018; Andreas et al., 2016). An implicit assumption behind such models is that NNs without hand-designed modular- ity do not become modular by themselves. In contrast, it was recently found how certain types of modular structures emerge in standard NNs (Watanabe, 2019; Filan et al., 2020). However, these methods use proxies such as activation statis- tics or connectivity to identify modules, making it unclear whether they correspond to functional blocks. For exam- ple functionally consistent modules might have units with different activation statistics and vice versa, making such analysis inappropriate for inspecting functional modularity. Establishing a better connection between modules and their function remains essential for obtaining a powerful analysis tool that helps to understand and alleviate the shortcomings of current NNs. Here we analyze functional modularity in common neural architectures: recurrent NNs (RNNs) and feedforward NNs (FNNs) including convolutional NNs (CNNs). We define modules as a subnetworks needed to perform a specific func- tion (functional modules). We train an NN on its original task (e.g., classify images into 10 classes), then freeze its weights. Now the original task is modified in line with a specific functionality of interest (e.g., leave one class out and solve the remaining 9). On this modified task, we train probabilistic, binary, but differentiable masks for all weights (while the NN’s weights remain frozen). The result is a bi- nary mask for each modified task, exhibiting the subnetwork necessary to perform it. To investigate whether the discovered functional modules support compositionality, we analyze whether the NN has two desirable properties: (P1) it uses different modules for very different functions, and (P2) it uses the same module for identical functions that may have to be performed multiple times. We experimentally show that many typical NNs exhibit P1 but not P2. By analyzing transfer learning on permuted MNIST, we provide additional insights into this issue. We offer a possible explanation: simple data routing between modules in standard NNs often would be highly desirable, but it is hard to learn by the NN’s weight matrix which typically must also learn to satisfy the conflicting goal of transforming the data in useful ways. Standard NNs have no bias towards separating the conceptually different

Upload: others

Post on 29-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Are Neural Nets Modular? Inspecting Their Functionality ...people.idsia.ch/~csordas/are_neural_networks_modular.pdf · Neural networks (NNs) whose subnetworks im-plement reusable

Are Neural Nets Modular? Inspecting Their Functionality ThroughDifferentiable Weight Masks

Robert Csordas 1 Sjoerd van Steenkiste 1 Jurgen Schmidhuber 1 2

AbstractNeural networks (NNs) whose subnetworks im-plement reusable functions are expected to of-fer numerous advantages, e.g., compositional-ity through efficient recombination of functionalbuilding blocks, interpretability, preventing catas-trophic interference by separation, etc. Under-standing if and how NNs are modular could pro-vide insights into how to improve them. Currentinspection methods, however, fail to link mod-ules to their function. We present a novel methodbased on learning binary weight masks to iden-tify individual weights and subnets responsiblefor specific functions. This powerful tool showsthat typical NNs fail to reuse submodules, becom-ing redundant instead. It also yields new insightsinto known generalization issues with the SCANdataset. Our method also unveils class-specificweights of CNN classifiers, and shows to whichextent classifications depend on them. Our find-ings open many new important directions for fu-ture research.

1. IntroductionModularity is an important organization principle in bothartificial (Baldwin & Clark, 2000; Schmidhuber, 1990) andbiological (Clune et al., 2013; von Dassow & Munro, 1999;Lorenz et al., 2011) systems. It provides a natural wayof achieving compositionality, which seems essential forsystematic generalization, one of the areas where typicalartificial neural networks (NNs) do not yet perform well(Fodor et al., 1988; Marcus, 1998; Lake & Baroni, 2018;Hupkes et al., 2019).

Recently, NNs with explicitly designed modules havedemonstrated superior generalization cabilities (Chang et al.,2019; Bahdanau et al., 2019; Clune et al., 2013; Kirsch et al.,

1The Swiss AI Lab, IDSIA / USI / SUPSI 2NNAISENSE.Correspondence to: Robert Csordas <[email protected]>.

2020 ICML Workshop on Human Interpretability in MachineLearning (WHI 2020). Copyright 2020 by the author(s).

2018; Andreas et al., 2016). An implicit assumption behindsuch models is that NNs without hand-designed modular-ity do not become modular by themselves. In contrast, itwas recently found how certain types of modular structuresemerge in standard NNs (Watanabe, 2019; Filan et al., 2020).However, these methods use proxies such as activation statis-tics or connectivity to identify modules, making it unclearwhether they correspond to functional blocks. For exam-ple functionally consistent modules might have units withdifferent activation statistics and vice versa, making suchanalysis inappropriate for inspecting functional modularity.Establishing a better connection between modules and theirfunction remains essential for obtaining a powerful analysistool that helps to understand and alleviate the shortcomingsof current NNs.

Here we analyze functional modularity in common neuralarchitectures: recurrent NNs (RNNs) and feedforward NNs(FNNs) including convolutional NNs (CNNs). We definemodules as a subnetworks needed to perform a specific func-tion (functional modules). We train an NN on its originaltask (e.g., classify images into 10 classes), then freeze itsweights. Now the original task is modified in line with aspecific functionality of interest (e.g., leave one class outand solve the remaining 9). On this modified task, we trainprobabilistic, binary, but differentiable masks for all weights(while the NN’s weights remain frozen). The result is a bi-nary mask for each modified task, exhibiting the subnetworknecessary to perform it.

To investigate whether the discovered functional modulessupport compositionality, we analyze whether the NN hastwo desirable properties: (P1) it uses different modules forvery different functions, and (P2) it uses the same module foridentical functions that may have to be performed multipletimes. We experimentally show that many typical NNsexhibit P1 but not P2. By analyzing transfer learning onpermuted MNIST, we provide additional insights into thisissue. We offer a possible explanation: simple data routingbetween modules in standard NNs often would be highlydesirable, but it is hard to learn by the NN’s weight matrixwhich typically must also learn to satisfy the conflictinggoal of transforming the data in useful ways. Standard NNshave no bias towards separating the conceptually different

Page 2: Are Neural Nets Modular? Inspecting Their Functionality ...people.idsia.ch/~csordas/are_neural_networks_modular.pdf · Neural networks (NNs) whose subnetworks im-plement reusable

Are Neural Nets Modular? Inspecting Their Functionality Through Differentiable Weight Masks

goals of data transformation and routing.

We show that the modules discovered by typical NNs do nottend to encourage compositional solutions. For example,we analyze encoder-decoder LSTMs on the SCAN dataset(Lake & Baroni, 2018) designed to test systematic gener-alization based on textual commands. We show that newweights are required to deal with new command combina-tions, even those combinations governed by already knownrules. The new weights are responsible solely for these newcombinations, indicating that the learned modules are non-compositional. Instead they are very pattern recognition-likeand fail to perform the more symbolic manipulation requiredfor generalization on SCAN.

Finally, we study whether functional modules are present inCNNs, which are thought to rely heavily on shared features.Surprisingly, we can identify subsets of weights solely re-sponsible for single classes. When removing such a subset,the performance on its class drops significantly. By ana-lyzing the resulting confusion matrices, we identify classesrelying on similar features.

2. Discovering Modular Structure viaWeight-Level Introspection

Current methods for discovering modular structure in NNsare based on clustering individual units based on their simi-larity (Watanabe, 2019; Filan et al., 2020). Such unit-levelanalysis might be misleading though. Units can be sharedeven when their weights, which perform the actual compu-tation, are not. Indeed, units can be viewed as mere ”wires”for transmitting information. Imagine a task where opera-tions on numbers should be performed, conditioned on theinput. The first hidden layer should represent the number af-ter the first operation, regardless of which one it was. Hencea weight level analysis is critical.

The method proposed in this paper inspects pre-trainedNNs by applying masks to their frozen weights. Masksare trained on modified tasks depending on the specificfunctionalities we want to analyze, e.g., sub-tasks of theoriginal problem, or tasks based on different data splits, totest generalization. The trained masks identify subnetworksresponsible for the corresponding functionalities.

First the NN is trained on the original, unmodified task.Once converged, its weights are frozen. Next we train themask on the modified task. For each weight, the mask de-fines a probability of keeping it. All N weights are handledindependently of each other. We use i ∈ [1, N ] to denotethe weight index. The mask’s probabilities are representedas logits li ∈ R, which are initialized to keep the weightswith high probability (90%). To restrict arbitrary scalingof the weights, we binarize masks before applying them.For binarization, we use the Gumbel-Sigmoid with straight-

through estimator, which we derive from Gumbel-Softmax(Jang et al., 2017; Maddison et al., 2017) in Appendix A. Asample s(li) ∈ [0, 1] from the mask can be drawn as:

s(li) = σ

(1

τ

(li − log

logU1

logU2

)), (1)

where U1, U2 are i.i.d. samples from the uniform distribu-tion U(0, 1), τ ∈ (0,∞) is the temperature and σ(x) =

11+e−x is the sigmoid function. Next, we can use a straight-through estimator to obtain a binarized sample:

mi = [1s(li)>0.5 − s(li)] + s(li), (2)

where 1x is the indicator function and [x] is an operator forpreventing gradient flow. The masks are applied element-wise:

w′i = wi ∗mi. (3)

The goal of the masking is to remove weights that are notrequired to implement the target function. Thus, logits lishould be regularized, such that the probability for weightwi to be active is small unless wi is necessary for the task.This is achieved by adding the term r = α

∑i li to the loss.

α ∈ [0,∞) is a hyper-parameter responsible for the strengthof the regularization.

At the end of the training process, deterministic binarymasks Mi ∈ {0, 1} are created by thresholding masks at50% probability: Mi = 1σ(li)>0.5. The mask M shows thesubnetwork required for the task. Because the probabilitymass is usually concentrated at 0 and 1, thresholding is safe(See Fig. 6 in Appendix).

We evaluate the resulting subnetwork by applying the mask,whose interpretation is task-dependent. For example, toshow that the resulting subset of weights is responsiblefor a particular subtask (A) but not for another (B), applythe mask trained on A and test on both. A performancedrop is expected on task B only. Otherwise, if a moduleis exclusively needed only for a particular subtask but notfor the other, we can invert the masks and test on bothtasks. The inverted masks are expected to perform wellon the complementary task, but not on the original one.However, this mask inversion method is limited to analyzingentirely disjoint weights. Weight sharing between taskscan be analyzed through the overlap of their correspondingweight masks (see Appendix B.1).

The precise experimental details of all our experiments areleft to the Appendix. Mean and standard deviations shownin the figures are calculated over 10 runs.

Conditions for the validity of the analysis. Choosingthe regularization hyperparameter α is critical for gettingvalid conclusions. Too low α might yield the false impres-sion that no modules exist or that they share more weights

Page 3: Are Neural Nets Modular? Inspecting Their Functionality ...people.idsia.ch/~csordas/are_neural_networks_modular.pdf · Neural networks (NNs) whose subnetworks im-plement reusable

Are Neural Nets Modular? Inspecting Their Functionality Through Differentiable Weight Masks

than they really do. Too strong regularization may degradethe performance on the target function, discarding essentialweights. Fortunately, there is a consistent heuristic for se-lecting the correct α, which follows from training a set ofmasks on the full task. We increase α as long as the per-formance does not start to drop. Then we reduce α slightlyuntil the performance is adequate. The method is not verysensitive to the exact value α, and it transfers well betweendifferent network sizes (Fig. 7 in Appendix). We find thatit is less critical but still important to tune the learning rateand the number of steps of the mask training process. Wealways check the validity of the selected hyperparametersby training a set of masks on the full, unmodified problem.We expect to see only a slight drop in performance.

Note that underfitting NNs tend to share more weights.Throughout our experiments, we found a large enough net-work is essential to avoid false conclusions about the reasonsbehind the sharing. See Fig. 9 in Appendix on how weightsharing changes with the network capacity.

3. Analyzing Module PropertiesLet us reconsider P1 and P2 (see intro) in more detail, asthey reflect the advantages of modular compositional design.According to our definition, the NN is not modular with-out P1. Moreover, disjoint modules prevent catastrophicinterference (McCloskey & Cohen, 1989; Rosenbaum et al.,2019) of different functions, because changing the weightsof one of the modules does not affect the others. RegardingP2, reuse has multiple advantages. It increases data effi-ciency to the extent that all relevant pieces are processedby the same module, which thus receives more training.Therefore, it also helps with generalization.

We construct synthetic datasets to verify whether NNs havenatural inductive biases supporting P1 and P2. The exper-iments are kept as simple as possible while targeting thespecific property of interest. The inputs of compositionalmodules can come from multiple sources. Consider arith-metic expressions of the form a ∗ b and (c+ d) ∗ e: the firstoperand of the multiplier can come directly from the inputor from the output of the adder. The same holds for theoutputs. Consider the four possible combinations: “sameinput/output (I/O), same function” is standard for trainingan NN on a single task. Our method is not of interest for thiscase because there are no sub-tasks to drive mask learning.Much more interesting are: “shared I/O, different function”to inspect property P1 (section 3.1), and “separate I/O, samefunction” to analyze P2 (section 3.2). We ignore the case“same I/O, different function” because the results of the twoexperiments above make it unlikely to provide new insights.

Using our tool, we can draw surprising conclusions aboutthe modularity of typical NNs, which tend to satisfy P1

layer 1 layer 2 layer 3 layer 4 output0

50

100

Sha

red

wei

ghts

[%]

+ ∗

lstm 0 ih lstm 0 hh lstm 1 ih lstm 1 hh output0

50

100

Sha

red

wei

ghts

[%]

+ ∗

Figure 1. Percentage of shared weights on the addi-tion/multiplication task per layer. FNN (top), LSTM (bottom)

but not P2. Our experiments suggest that weight sharingacross tasks is mostly driven by shared I/O rather than tasksimilarity.

3.1. Addition/Multiplication Experiments

The addition/multiplication dataset is designed to test thebias towards separating different functions (P1). It definesa single shared input and output. Two-digit numbers areencoded as one-hot vectors, each representing a digit. Eachinput is defined by two operands and a specification of theoperation (a two-element one-hot vector), resulting in a 42-dimensional input vector. The target is a two-digit modulo100 result, yielding 20 output elements.

First, we train the network to perform this task without anymasking. Once performance is nearly perfect, we freeze itsweights. We perform two stages of mask training: first, wetrain a set of masks on addition (multiplication examplesexcluded), then we repeat this procedure for multiplication.Note again that mask training cannot change the weights:they can only be kept or removed.

We analyze an FNN and an LSTM (Hochreiter & Schmidhu-ber, 1997) on this task. For LSTM, we repeat the full inputfor a fixed number of timesteps. The result is the outputof the final step. No loss is applied at intermediate steps.Regardless of the architecture, we found the same generaltendencies: There is more sharing in the input and outputlayers, less in the hidden layers (Fig. 1). We found thatmultiplication uses 3.8 times more weights than addition(Fig. 11 in Appendix), explaining the smaller proportion ofits weights being shared. We conclude that there appears tobe some bias in the network towards implementing differentfunctions using distinct weights. On the other hand, thecurrent amount of separation might still be inadequate toprevent interference and catastrophic forgetting. Increasedsharing in I/O layers could be due to a switching/routing

Page 4: Are Neural Nets Modular? Inspecting Their Functionality ...people.idsia.ch/~csordas/are_neural_networks_modular.pdf · Neural networks (NNs) whose subnetworks im-plement reusable

Are Neural Nets Modular? Inspecting Their Functionality Through Differentiable Weight Masks

+ ∗ nonePredicted

+∗Tr

ue 99± 0 1± 0 0± 00± 0 100± 0 0± 0

0

100

(a) Mask trained on both + and ∗

+ ∗ nonePredicted

+∗Tr

ue 99± 0 1± 0 0± 095± 9 0± 0 4± 9

0

100

(b) Mask trained on +

+ ∗ nonePredicted

+∗Tr

ue 1± 0 99± 2 1± 20± 0 100± 0 0± 0

0

100

(c) Mask trained on ∗

Figure 2. Analysis of FNN performance degradation on the addition/multiplication task. The y-axis shows the operation specified inthe input. The x-axis shows the actual operation performed. “none” means the predicted number is neither the result of addition normultiplication. The network ignores the operator specification and performs the one corresponding to the mask instead.

procedure used to select which operation to perform.

We also analyzed how performance breaks down on thetask for which the mask was not trained (Fig. 2). Herethe behavior of the FNN and the RNN differ. The FNNtends to ignore the function description in the input, andperforms the operation corresponding to what the maskwas trained on. The LSTM, on the other hand, tends toproduce outputs that are invalid for both operations (Fig. 10in Appendix), suggesting that it learned a solution wherethe two operations are more intertwined.

3.2. Double-Addition Experiments

The double-add experiment is designed to test property P2.The task is to perform modulo 100 addition twice by reusingweights. The locations of inputs and outputs differ for thetwo instances, simulating a more realistic scenario of dif-ferent data sources within the network when composingmodules dynamically, without considering the additionalproblem of finding the right composition. Since the opera-tion is the same and the data distributions are exact matches,this is the simplest setup that encourages sharing. Withinputs a, b, c and d, the network should output a + b andc+ d. The encoding is the same as in section 3.1, resultingin 80 input and 40 output units.

We first train the network until convergence on the full task,then freeze its weights. We train a set of masks on a + b,followed by c+d. We analyze both FNN and LSTM, takingcare to avoid activation interference. When both operationshave to be performed simultaneously, sharing is impossible.Thus, for the FNN, we perform two forward passes. Ineach pass, we feed only one pair of numbers to the network(either a, b or c, d), while zeroing out the other. WithLSTM, we investigate two different settings. In the firstcase, both pairs are presented together for a fixed number ofsteps, and the result is the last output. Hence, the LSTM isallowed to schedule the execution of the operations freely.In the second case, we feed a single pair for multiple stepswith the other zeroed out, and then read its output. Thisprocedure is then repeated for both pairs without resettingthe state, which removes any incentive for solving themsimultaneously since only one pair is seen at a time.

Full Pair 1 ¬Pair 1 Pair 2 ¬Pair 20

50

100

Acc

urac

y[%

]

Figure 3. Double-addition task: accuracy on pair 1 and pair 2 withFNN and pair 1 and pair 2 with LSTM. The x-axis shows on whatthe applied mask was trained on. ¬ denotes an inverted mask.

The results of all experiments are consistent: weight sharingis low (Fig. 12 in the Appendix). To verify how independentthe modules are, we invert masks trained on pair A, remov-ing all units needed for A, including the shared ones. Wetest the resulting network on pair B, where the performancedecreases only slightly, confirming that they are independent(Fig. 3). No difference between the two LSTM variants isapparent (Fig. 13 in Appendix).

These observations show that P2 is violated even in thissimple case. Realistic scenarios tend to be more complex asthe data distribution for different instances of the operationmight be different (with overlaps), providing even fewerincentives to share. Furthermore, comparing the results tothose of section 3.1, it is apparent that sharing depends moreon the location of the inputs/outputs than on the similarityof the performed operations. This behavior is undesired andcalls for further research.

3.3. A Potential Explanation for Lack of WeightSharing

Experiment in section 3.2 (and appendix E) consistentlyshow that the internal parts of NNs resist to share compu-tations. We hypothesize that the reason is the difficulty ofrouting in standard NNs. Inputs and outputs must be routedto different sources/targets to reuse modules in differentcompositions. In routing networks (Rosenbaum et al., 2019;Kirsch et al., 2018; Chang et al., 2019), this is done throughhand-designed mechanisms. Without them, routing can onlyoccur through the weights of the NN. But weight matrix-based transformations normally change the data represen-

Page 5: Are Neural Nets Modular? Inspecting Their Functionality ...people.idsia.ch/~csordas/are_neural_networks_modular.pdf · Neural networks (NNs) whose subnetworks im-plement reusable

Are Neural Nets Modular? Inspecting Their Functionality Through Differentiable Weight Masks

tations unless the weights have a special structure and theunits have appropriate activation functions. This structuremay be difficult to learn. In fact, our NNs should learn to dy-namically change the composition in a problem-dependentway, which is an additional level of complexity. Our ex-periments suggest that typical NNs find it hard to learn torepresent data in the same way on the different paths ofinformation flow and process them by a single module. In-stead, they prefer to learn a new set of weights for everyuse case of a function given enough capacity. We arguethat this is an important issue that limits generalization anddata efficiency. Research on inductive biases is needed toreduce such redundancy regardless of network size and taskdifficulty.

4. Analyzing Generalization Issues on SCANDataset

The failure of typical NNs at more systematic generalizationis one of the central issues of current-day NN research. TheSCAN dataset (Lake & Baroni, 2018) is designed for ana-lyzing this property. It consists of compositional commands(e. g. “jump thrice”), to be translated into primitive outputmoves (e. g. “JUMP JUMP JUMP”). The dataset comesin different types of splits. The “simple” split is i.i.d. The“length” split has shorter training samples than test samples,but their structure is identical. Finally, the ”add primitive”splits have a particular command presented in the trainingset but no compositions of this command with others.

It was previously shown that typical NNs have generaliza-tion issues regarding the “length” and some of the “addprimitive” splits (Lake & Baroni, 2018). The reason forthis is unclear, however. There might be two explanations.(a) The NN might have learned an algorithm to solve theproblem, but the quality of the learned representations areinsufficient. For the “length” split, noise might accumu-late in the NN’s state. For the “add primitive” split, the NNmight be unable to form an analogy between the problematicprimitive and the well-performing ones, because of limitedvariety in the training data. In either case, the NN has lim-ited pressure to improve since the learned representationsare sufficient to solve the entire training set. (b) Alterna-tively, the NN might not have learned the correct algorithmfor the solution. It might have learned components of it,but it still needs new weights to solve a new problem of thesame kind. Only in this case, we argue, has the NN failed toleverage the compositional nature of the problem.

We use the baseline 2 layer LSTM encoder-decoder modelfrom (Lake & Baroni, 2018) (for details refer to AppendixF). We train the model on the “simple” split until conver-gence and freeze its weights. The learned representationsand weights are capable of solving each split, which weverify at the end of the training phase. For each split, we

Turn Left Jump Length0

50

100

Test

accu

racy

[%]

(a) Performance on different splits

0 20 40 60 80

05

100 120 140 160 180Input

05

Out

put

(b) Weights removed from last layer on “Add jump” split

Figure 4. Results of experiments on SCAN. (a) Accuracy on splitshown on x-axis with masks trained on the full problem and withmasks trained on split shown on x-axis. (b) Elements removed(dark blue) from the weight masks of the last layer trained on the“Add jump” split. Output 2 corresponds to “JUMP”. Broken into2 lines for better fit. The white elements correspond to weightsnot needed for any of the splits. Most elements corresponding to“JUMP” are removed.

train a different set of masks. We measure the performanceof the network on the corresponding test split. The wordembeddings are excluded from the masking process andremain unmodified after initial training.

Our experiments support explanation (b). We found a biggeneralization gap on the “length” and the “add jump” splits,whereas the “add turn left” caused only a slight degradation(Fig. 4a). The results are consistent with earlier findings(Lake & Baroni, 2018). The crucial additional insight gainedwith our tool’s help comes from the fact that we started witha network whose representations and weights can alreadysolve the full problem. Just by removing weights, the perfor-mance degrades on samples excluded from the training set,even though all the tokens and rules are still seen in differentcombinations. Thus, the solution learned on the full datasetdoes not embody the correct algorithm, confirming that theroot of the problem is explanation (b).

By inspecting the weights we draw additional conclusions.In case of the “add jump” split, the most apparent differ-ence is in the output layer. Most weights correspondingto “JUMP” are removed (Fig. 4b). This suggests that thenetwork learned to detect patterns of cases when “JUMP”should be the output, and the last layer puzzles them to-gether. In contrast to this pattern recognition approach, thegeneralizing algorithm for solving SCAN and other reason-ing problems requires a variable manipulation-like solution(Garnelo & Shanahan, 2019; Lake et al., 2017). For thelength split, the interpretation is not obvious.

Page 6: Are Neural Nets Modular? Inspecting Their Functionality ...people.idsia.ch/~csordas/are_neural_networks_modular.pdf · Neural networks (NNs) whose subnetworks im-plement reusable

Are Neural Nets Modular? Inspecting Their Functionality Through Differentiable Weight Masks

airpla

ne

autom

obile bir

d cat

deer do

gfro

gho

rse ship

truck

Predicted label

airplane

automobile

bird

cat

deer

dog

frog

horse

ship

truck

True

labe

l

−59± 6 4± 1 24± 3 2± 1 3± 1 0± 0 1± 0 2± 1 18± 2 5± 1

−1± 1 1± 1 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0

−4± 1 0± 0 4± 1 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0

−2± 1 0± 0 1± 0 1± 1 0± 0 −1± 1 0± 0 0± 0 0± 0 0± 0

−2± 0 0± 0 1± 1 0± 0 1± 1 0± 0 0± 0 0± 1 0± 0 0± 0

−1± 0 0± 0 1± 0 0± 1 0± 0 −1± 1 0± 0 0± 0 0± 0 0± 0

−1± 0 0± 0 1± 1 0± 0 0± 0 0± 0 0± 1 0± 0 0± 0 0± 0

−1± 0 0± 0 0± 0 0± 0 0± 0 0± 1 0± 0 0± 1 0± 0 0± 0

−4± 1 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 3± 1 0± 0

−2± 1 0± 1 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 1± 1

−40

−20

0

20

Figure 5. Absolute changes in the normalized confusion matrix onCIFAR 10, when features exclusive to “airplane” are removed [%].The target class suffers a significant performance drop. Airplaneswill be classified as birds or ships, probably because they relyheavily on the blue background.

5. Analyzing Convolutional Neural NetworksFindings from section 3.2 suggest that typical neural net-works are biased against weight sharing. We investigate thiseffect in CNNs, which are believed to rely a lot on sharedfeature extractors (Zeiler & Fergus, 2014). Our method isable to highlight the set of weights solely responsible forindividual classes and not shared by any other. First, wetrain a CNN on CIFAR 10 (details in Appendix G) andfreeze its weights. Then we train a set of control maskson all classes. Next, we train a set of masks on all classesexcept some target class. This setup ensures that weightssolely responsible for the target class will be the only onesmissing from the resulting masks. We repeat this for allclasses. The mask of the last classifier layer is only trainedon the full task with all classes. The other stages copy it andkeep it unmodified throughout the training process. If masksfor the last layer were learned, all units responsible for thetarget class would be removed from the output projection,dropping the performance of the target class to zero, makingit impossible to draw conclusions about the importance ofclass-exclusive features.

We compute the confusion matrix on the full validation setat the end of each training stage. We calculate the absolutedifference between the confusion matrices with and with-out the target class removed, highlighting how the removalchanges the classification. Interestingly, although relativelyfew weights are removed (Fig. 16 in Appendix), the truepositive rate of the target class suffers a significant dropof up to 60%. This drop shows a heavy reliance on class-exclusive, non-shared features found by our method. Thedrop on non-target classes is insignificant and probably theresult of noise introduced by the mask sampling process.

Analyzing the difference in misclassification rates providesinsights as well: as the true positive rate drops, some classesare predicted more frequently. These rely on similar sharedfeatures. For example, removing “airplane” causes airplane

images to be classified as “birds” and “ships” (Fig. 5).Indeed, they have a very distinct shared feature: the back-ground color is blue. More insightful conclusions can bedrawn for other classes (see Fig 18 in Appendix). Overall,the findings are in line with sections 3.2 and E: the networkseems to resist modular weight sharing.

6. Related WorkAmong the few attempts to identify modules in NNs Filanet al. (2020) is most related to ours. It identifies clustersof neurons with strong internal and weak external connec-tivity. Others detect communities (Watanabe et al., 2018)or cluster units hierarchically based on activation statistics(Watanabe, 2019). All of these methods, however, are unit-based, leaving them vulnerable to the issues discussed insection 2. Mutual information-based methods can help todetect important pathways in NNs (Davis et al., 2020). Theabove methods, however, neither ground the discoveredmodules in their functionality, nor analyze whether theysupport compositionality. PackNet (Mallya & Lazebnik,2018) uses weights masks for continual learning. But itsmasks are obtained by pruning at the end of the trainingphase, when the NN must be retrained, and its tasks arelearned sequentially, preventing possible co-adaptation offeatures. Moreover, the method was not used to analyzeNNs and their modularity. Numerous papers analyze sys-tematic generalization (Lake & Baroni, 2018; Bahdanauet al., 2019; Hill et al., 2020; Hupkes et al., 2019). However,they leave open the question whether lack of generalizationis due to inferior quality of learned representations or tonon-compositionality of learned algorithms.

7. ConclusionOur new method for inspecting modularity in neural net-works (NNs) is the first to identify modules by their func-tionality, making it easy to interpret what NNs really do. It isalso a powerful tool for analyzing how the discovered mod-ules of NNs share or separate weights based on functionalproperties. Unfortunately, it shows that in typical currentNNs, weight sharing between modules responsible for solv-ing certain tasks does not really reflect task similarity (asdesired) but can mostly be explained by rather trivial sharedI/O interfaces of solution-implementing modules. With dis-joint module interfaces, typical RNNs and CNNs exhibita consistent pattern of failing to share weights. Althoughfunctional modules should play an essential role in learningcompositional and generalizable solutions, our case studyon the SCAN dataset explains generalization problems oftypical NNs by their failure to learn modular, compositionalalgorithms. We show that these problems are not merelydue to insufficient quality of learned data representations.Our findings open many new directions for future research.

Page 7: Are Neural Nets Modular? Inspecting Their Functionality ...people.idsia.ch/~csordas/are_neural_networks_modular.pdf · Neural networks (NNs) whose subnetworks im-plement reusable

Are Neural Nets Modular? Inspecting Their Functionality Through Differentiable Weight Masks

8. AcknowledgementsThe authors wish to thank Aleksandar Stanic, FrancescoFaccio, Kazuki Irie, Louis Kirsch and the anonymous re-viewers for their constructive feedback. We are also gratefulto NVIDIA Corporation for donating a DGX-1 as part ofthe Pioneers of AI Research Award and to IBM for donat-ing a Minsky machine. This research was supported by anEuropean Research Council Advanced Grant (no: 742870).We are also thankful for Weights & Biases for making theirtool freely accessible for members of academia.

ReferencesAndreas, J., Rohrbach, M., Darrell, T., and Klein, D. Neural

module networks. In The IEEE Conference on ComputerVision and Pattern Recognition (CVPR), June 2016.

Bahdanau, D., Murty, S., Noukhovitch, M., Nguyen, T. H.,de Vries, H., and Courville, A. Systematic generalization:What is required and can it be learned? In InternationalConference on Learning Representations (ICLR), 2019.

Baldwin, C. Y. and Clark, K. B. Design rules / the power ofmodularity. MIT Press, 2000.

Chang, M., Gupta, A., Levine, S., and Griffiths, T. L. Auto-matically composing representation transformations as ameans for generalization. In International Conference onLearning Representations (ICLR), 2019.

Clune, J., Mouret, J.-B., and Lipson, H. The evolutionaryorigins of modularity. Proceedings of the Royal Society B:Biological Sciences, 280(1755), Mar 2013. ISSN 1471-2954. doi: 10.1098/rspb.2012.2863.

Davis, B., Bhatt, U., Bhardwaj, K., Marculescu, R., andMoura, J. M. On network science and mutual informa-tion for explaining deep neural networks. In IEEE In-ternational Conference on Acoustics, Speech and SignalProcessing (ICASSP), pp. 8399–8403, 2020.

Fernando, C., Banarse, D., Blundell, C., Zwols, Y., Ha, D.,Rusu, A. A., Pritzel, A., and Wierstra, D. Pathnet: Evolu-tion channels gradient descent in super neural networks.arXiv preprint arXiv:1701.08734, 2017.

Filan, D., Hod, S., Wild, C., Critch, A., and Russell, S.Neural networks are surprisingly modular. arXiv preprintarXiv:2003.04881, 2020.

Fodor, J. A., Pylyshyn, Z. W., et al. Connectionism andcognitive architecture: A critical analysis. Cognition, 28(1-2):3–71, 1988.

Garnelo, M. and Shanahan, M. Reconciling deep learningwith symbolic artificial intelligence: representing objectsand relations. Current Opinion in Behavioral Sciences,29:17–23, 2019.

Golkar, S., Kagan, M., and Cho, K. Continual learning vianeural pruning. In NeurIPS 2019 Workshop Neuro AI,2019.

Hill, F., Lampinen, A. K., Schneider, R., Clark, S.,Botvinick, M., McClelland, J. L., and Santoro, A. Envi-ronmental drivers of systematicity and generalization in asituated agent. In International Conference on LearningRepresentations (ICLR), 2020.

Hochreiter, S. and Schmidhuber, J. Long short-term memory.Neural computation, 9(8):1735–1780, 1997.

Hupkes, D., Dankers, V., Mul, M., and Bruni, E. The com-positionality of neural networks: integrating symbolismand connectionism. arXiv preprint arXiv:1908.08351,2019.

Jang, E., Gu, S., and Poole, B. Categorical reparametrizationwith gumbel-softmax. In International Conference onLearning Representations (ICLR), 2017.

Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Des-jardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T.,Grabska-Barwinska, A., et al. Overcoming catastrophicforgetting in neural networks. Proceedings of the nationalacademy of sciences, 114(13):3521–3526, 2017.

Kirsch, L., Kunze, J., and Barber, D. Modular networks:Learning to decompose neural computation. In Advancesin Neural Information Processing Systems 31, pp. 2408–2418, 2018.

Kolouri, S., Ketz, N., Zou, X., Krichmar, J., and Pilly,P. Attention-based structural-plasticity. arXiv preprintarXiv:1903.06070, 2019.

Lake, B. M. and Baroni, M. Generalization without sys-tematicity: On the compositional skills of sequence-to-sequence recurrent networks. In Proc. International Con-ference on Machine Learning (ICML), volume 80, pp.2879–2888, 2018.

Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gersh-man, S. J. Building machines that learn and think likepeople. Behavioral and brain sciences, 40, 2017.

Lorenz, D. M., Jeng, A., and Deem, M. W. The emergenceof modularity in biological systems. Physics of life re-views, 8(2):129–160, Jun 2011.

Maddison, C. J., Mnih, A., and Teh, Y. W. The concretedistribution: A continuous relaxation of discrete randomvariables. In International Conference on Learning Rep-resentations (ICLR), 2017.

Mallya, A. and Lazebnik, S. Packnet: Adding multipletasks to a single network by iterative pruning. In Proc.

Page 8: Are Neural Nets Modular? Inspecting Their Functionality ...people.idsia.ch/~csordas/are_neural_networks_modular.pdf · Neural networks (NNs) whose subnetworks im-plement reusable

Are Neural Nets Modular? Inspecting Their Functionality Through Differentiable Weight Masks

Conference on Computer Vision and Pattern Recognition(CVPR), pp. 7765–7773, Salt Lake City, UT, USA, June2018.

Marcus, G. F. Rethinking eliminative connectionism. Cog-nitive psychology, 37(3):243–282, 1998.

McCloskey, M. and Cohen, N. J. Catastrophic interfer-ence in connectionist networks: The sequential learningproblem. In Psychology of learning and motivation, vol-ume 24, pp. 109–165. Elsevier, 1989.

Rosenbaum, C., Cases, I., Riemer, M., and Klinger, T. Rout-ing networks and the challenges of modular and compo-sitional computation. arXiv preprint arXiv:1904.12774,2019.

Schmidhuber, J. Towards compositional learning with dy-namic neural networks. Technical Report FKI-129-90,Institut fur Informatik, Technische Universitat Munchen,1990.

von Dassow, G. and Munro, E. Modularity in animal devel-opment and evolution: elements of a conceptual frame-work for evodevo. The Journal of experimental zoology,285(4):307—325, December 1999.

Watanabe, C. Interpreting layered neural networks via hi-erarchical modular representation. In International Con-ference on Neural Information Processing, pp. 376–388,2019.

Watanabe, C., Hiramatsu, K., and Kashino, K. Modular rep-resentation of layered neural networks. Neural Networks,97:62–73, 2018.

Zeiler, M. D. and Fergus, R. Visualizing and understand-ing convolutional networks. In European conference oncomputer vision, pp. 818–833. Springer, 2014.

Page 9: Are Neural Nets Modular? Inspecting Their Functionality ...people.idsia.ch/~csordas/are_neural_networks_modular.pdf · Neural networks (NNs) whose subnetworks im-plement reusable

Are Neural Nets Modular? Inspecting Their Functionality Through Differentiable Weight Masks

A. From Gumbel-Softmax toGumbel-Sigmoid

In what follows, we use the notation by Jang et al. (Janget al., 2017): k is the number of categories, class probabil-ities are πi, elements of sample vector y ∈ Rk from theGumbel-Softmax distribution (Jang et al., 2017) (also calledConcrete distribution) are denoted by yi. We will refer toli = log πi as logits. We show how to sample from theGumbel-Sigmoid distribution, the special case of k = 2,l2 = 0 of the Gumbel-Softmax distribution.

First, we show that the sigmoid is equivalent to a first ele-ment y1 ∈ R of the output vector of the two element softmaxwith l1 = x, x ∈ R and l2 = 0:

σ(x) =1

1 + e−x=

ex

ex + 1=

ex

ex + e0=

el1

el1 + el2= y1

(4)

According to Jang et al. (Jang et al., 2017), sample vectory ∈ Rk from the Gumbel-Softmax distribution can be drawnas follows:

yi =e

1τ (li+gi)∑k

j=1 e1τ (lj+gj)

(5)

where gi ∼ Gumbel(0, 1) are independent samples fromthe Gumbel distribution. We are interested in the specialcase of a sigmoid, which we showed to be equivalent to they1 in k = 2, l2 = 0 case:

y1 =e

1τ (l1+g1)

e1τ (l1+g1) + e

1τ g2

(6)

Which can be rearranged:

y1 =1

1 + e−1τ (l1+g1−g2)

= σ

(1

τ(l1 + g1 − g2)

)(7)

Writing out the inverse transformation sampling formulafor gi ∼ Gumbel(0, 1); gi = − log(− logUi), wereUi ∼ U(0, 1) are independent samples from the uniformdistribution, we get:

y1 = σ

(1

τ(l1 − log(− logU1) + log(− logU2))

)(8)

y1 = σ

(1

τ

(l1 − log

logU1

logU2

))(9)

Finally, by renaming s(l) = y1 and l = l1 (we have just asingle logit), we obtain the sampling formula for Gumbel-Sigmoid:

s(l) = σ

(1

τ

(l − log

logU1

logU2

))(10)

A.1. Straight-Through Estimator

Samples from the Gumbel-Softmax distribution can directlybe converted to a sample from the categorical distributionas:

ci = 1i=argmaxi yi (11)

Which gives rise to the straight-through estimator which canprovide hard samples while permitting gradient flow ([x] isan operator for blocking gradient flow):

ci = [1i=argmaxi yi − yi] + yi (12)

Since in the Gumbel-Sigmoid we have k = 2 categories and∑i yi = 1, argmax can be replaced by checking whether

yi > 0.5:

ci = [1yi>0.5 − yi] + yi (13)

B. Implementation DetailsOur method is implemented in PyTorch, and available onhttps://github.com/xdever/modules. We usethe Adam optimizer, a batch size of 128, a learning rate of10−3, and a gradient clipping of 1. To obtain better maskquality, we use 4 samples of masks per batch, meaning thateach quarter of the batch uses a different mask sample. Fornon-LSTM networks, we use the ReLU activation function.The Gumbel-sigmoid always has a temperature of τ = 1.

All figures in this paper show mean and unbiased standarddeviations calculated over 10 runs with different seeds.

For most of our experiments the regularization coefficient αis specified as β = bα, where b is the batch size.

B.1. Calculating the Degree of Weight Sharing

Given a target mask ti ∈ {1, 0} and a reference ri ∈ {1, 0},the proportion s(t, r) ∈ [0, 1] of t shared with r is calculatedby:

s(t, r) =

∑i tiri∑i ti

(14)

Reference r represents the weights used by other tasks. Ifthere are more than two tasks, r is the logical OR of all ofthem, except for t.

Page 10: Are Neural Nets Modular? Inspecting Their Functionality ...people.idsia.ch/~csordas/are_neural_networks_modular.pdf · Neural networks (NNs) whose subnetworks im-plement reusable

Are Neural Nets Modular? Inspecting Their Functionality Through Differentiable Weight Masks

0.0 0.2 0.4 0.6 0.8 1.0mask value

0

500

1000de

nsity

+/∗ FNN+/∗ LSTM+/+ FNN+/+ LSTM

(a) Linear scale, small values cut off

0.0 0.2 0.4 0.6 0.8 1.0mask value

0

5

10

15

log

dens

ity

+/∗ FNN+/∗ LSTM+/+ FNN+/+ LSTM

(b) Log scale, all values

Figure 6. Histogram (normalized as an 500-bin PDF) of mask val-ues on different tasks. (a) Shown on linear scale. Values < 0.0002(bottom 10% of the first bin) are removed from calculation becausetheir number vastly exceeds the number of kept weights for most ofthe networks, making the details invisible. (b) All masks (withoutsmall values removed) on log-scale.

10−5 10−4 10−3

β = bα

70

80

90

100

Acc

urac

y[%

]

30

40

50

(a) Big network

10−5 10−4 10−3

β = bα

70

80

90

100

Acc

urac

y[%

]

30

40

50

(b) Medium network

Figure 7. Sensitivity analysis for hyperparameter β = bα (b isthe batch size) on addition/multiplication experiments. Note thelogarithmic x-axis. The color shows the amount of total sharing[%]. Red line and the star indicates the value chosen for ourexperiments. Each point is a mean of 10 independent seeds. Thenetwork is not very sensitive to the exact choice of β. (a) Bignetwork, with 5 layers of size 2000. (b) Medium network, with 5layers of size 800. It can be seen that the hyperparameter transferswell between networks sizes.

Full + ¬+ ∗ ¬∗0

50

100

Acc

urac

y[%

] FNN +FNN ∗RNN +RNN ∗

Figure 8. Accuracy of addition/multiplication task on + and ∗with FNN and + and ∗ with LSTM. The x-axis shows on what theapplied mask was trained on. ¬ denotes an inverted mask.

small medium big huge0

20

40

60

Tota

lsha

red

[%]

Figure 9. Addition/multiplication task: the total proportion ofshared weights for the “add” operation for different network sizes.“small” means a 4 layer network with hidden sizes of 400, 400, 200,“medium” 5 layers / hidden sizes of 800, “big” 5 layers / 2000,“huge” 5 layers / 4000.

C. Addition/Multiplication ExperimentsSince our experiments show that modulo 100 multiplicationrequires lots of weights, we use reasonably big networks forthis experiment. The FNN is 5 layers deep, each layer hav-ing 2000 units. The LSTM has a state size of 256 becausefurther increases cause overfitting. We train the network for20k steps before freezing. Each mask training phase takesan additional 20k steps. Mask training uses a learning rateof 10−2, 10x higher than the one used for the weights, andregularizer β = 10−4. LSTM uses 3 time steps, where theinput is repeated over every step. The dataset is generatedby sampling numbers and operations randomly.

Fig. 10 shows that unlike in the FNN case, if we testthe LSTM on the operation opposite to what the maskwas trained on, the network performs invalid operationsin roughly 70% of the cases. Compare this to the FNN casein Fig. 2.

Fig. 9 shows that while the network with 5 layers and 2000units is already way oversized for the task (the small, 3 layernetworks of 400, 400, 200 can solve the task), sharing stillchanges with increased network size.

Fig. 8 shows limited separability of addition and multiplica-tion tasks. Given the high proportion of sharing, especiallyin the input and output layers, the results are as expected:inverted masks do not perform well.

Page 11: Are Neural Nets Modular? Inspecting Their Functionality ...people.idsia.ch/~csordas/are_neural_networks_modular.pdf · Neural networks (NNs) whose subnetworks im-plement reusable

Are Neural Nets Modular? Inspecting Their Functionality Through Differentiable Weight Masks

+ ∗ nonePredicted

+∗Tr

ue 99± 0 1± 0 0± 00± 0 99± 0 0± 0

0

100

(a) Mask trained on both + and ∗

+ ∗ nonePredicted

+∗Tr

ue 99± 0 1± 0 0± 019± 7 3± 1 78± 6

0

100

(b) Mask trained on +

+ ∗ nonePredicted

+∗Tr

ue 1± 0 27± 5 71± 50± 0 100± 0 0± 0

0

100

(c) Mask trained on ∗

Figure 10. Analysis of the RNN’s performance degradation in theaddition/multiplication task. The y-axis shows the operation speci-fied in the input. The x-axis shows the actual operation performed.“none” means the predicted number is neither the result of additionnor multiplication. The network performs invalid operations inmost of the cases.

layer 1 layer 2 layer 3 layer 4 output0

1

2

No.

ofw

eigh

ts

×104

+ ∗

(a) FNN

lstm 0 ih lstm 0 hh lstm 1 ih lstm 1 hh output0

1

2

No.

ofw

eigh

ts

×104

+ ∗

(b) LSTM

Figure 11. Addition/multiplication taks: number of weights peroperation for each layer in (a) feedforward network, (b) LSTM.

layer 1 layer 2 layer 3 layer 4 output0

10

20

Sha

red

wei

ghts

[%]

Pair 1Pair 2

(a) FNN

lstm 0 ih lstm 0 hh lstm 1 ih lstm 1 hh output0

10

20

Sha

red

wei

ghts

[%]

Pair 1Pair 2

(b) LSTM

Figure 12. Double addition task: percentage of weights shared peroperation in case of (a) feedforward network, (b) LSTM. The firstand last layers have zero shared weights.

Full Pair 1 ¬Pair 1 Pair 2 ¬Pair 20

50

100

Figure 13. Double addition task: accuracy on pair 1 and pair 2LSTMs with only one input presented at a time, to prevent interfer-ence. The x-axis shows on what the applied mask was trained on.¬ denotes an inverted mask. Compared to Fig. 3, no improvementis apparent.

D. Double Addition ExperimentsThe training protocol for double-add experiments is identi-cal to the one in Appendix C, except that the FNN variantuses mask regularizer of β = 4 ∗ 10−4. LSTM uses 6 stepsin total (3 steps per operation). In the full input case, both tu-ples are presented at the input for all 6 steps, and the outputis read from the last step. If only one tuple is presented at atime, the first tuple is shown for the first 3 steps, resultingin an output at the 3rd step, the second tuple is presented forthe next 3 steps, resulting in an output at the 6th step.

The degree of weight sharing is shown in Fig. 12.

The LSTM with only one input presented at a time is shownin Fig. 13. No improvement is apparent compared to thesetting where the LSTM can schedule the operations freely(Fig. 3).

E. Transfer Learning ExperimentsWe also analyzed the degree to which P2 is violated in amore complex setting—see section 3.2. We measure the

Page 12: Are Neural Nets Modular? Inspecting Their Functionality ...people.idsia.ch/~csordas/are_neural_networks_modular.pdf · Neural networks (NNs) whose subnetworks im-plement reusable

Are Neural Nets Modular? Inspecting Their Functionality Through Differentiable Weight Masks

Layer 1 Layer 2 Layer 3 Layer 4

0

25

50

75W

eigh

tssh

ared

[%]

T2T4T6

T8T10

Figure 14. Percentage of weights shared per layer after every sec-ond task on permuted MNIST. Each task corresponds to a permuta-tion. The last layer has the lowest capacity, filling up first, forcingsubsequent runs to share weights.

amount of transfer in a continual learning setup. A popularbenchmark for continual learning is permuted MNIST (Kirk-patrick et al., 2017; Golkar et al., 2019; Kolouri et al., 2019).A sequence of tasks is created by applying task-specificpermutations to MNIST images. Spatially close pixels of agiven digit may no longer be observed in nearby locations.We sequentially train an FNN on these tasks, switching to anew permutation once adequate performance is reached.

Continual learning is closely related to transfer learning.The new tasks are expected to reuse knowledge learned fromprevious tasks, resulting in faster convergence and fewernew parameters. Popular methods for continual learningrevolve around freezing necessary weights when a new taskis added (Fernando et al., 2017; Mallya & Lazebnik, 2018;Golkar et al., 2019). We adapt our method in a similar spirit.Masks and weights are trained simultaneously. At the endof training on a task, we freeze the used weights and savethe masks. The free weights are reinitialized, and a new setof masks is allocated for the next task. We obtain a set ofmasks, each corresponding to a permutation. This processis reminiscent of PackNet (Mallya & Lazebnik, 2018), withthe difference of being single-pass, in contrast to PackNet’sthree-step process (train, prune, retrain).

Since the tasks differ only by the permutation of the inputs,it is sufficient to re-train a new first layer of the network,which undoes the permutation and implements the sametransformation as the original. The rest of the net can bereused without modification. However, we found only in-significant weight sharing as long as there is sufficient freespace available (Fig. 14). Only once all the capacity is filledup, weights start being shared. This is especially apparentin the last layer of the network.

The initialization of new masks at the beginning of the train-ing phase partially destroys knowledge about the previousstage’s connectivity, which we suspect might be the reasonfor failing to share weights. To test this hypothesis, weinitialized elements of new masks corresponding to usedweights by a significantly higher probability than those cor-responding to new weights. Intuitively, this favours the old,

Layer 1 Layer 2 Layer 3 Layer 40

20

40

60

80

Wei

ghts

shar

ed[%

] T2T4T6

T8T10

(a) New weights with P = 0.5.

Layer 1 Layer 2 Layer 3 Layer 40

20

40

60

80

Wei

ghts

shar

ed[%

] T2T4T6

T8T10

(b) New weights with P ≈ 0.27.

Figure 15. Percentage of weights shared per layer after every sec-ond task on permuted MNIST, for a network with masks initializedto prefer reuse of old weights. Each task corresponds to a per-mutation. The last layer has the lowest capacity, filling up first,forcing the subsequent runs to share weights. Decreasing probabil-ity of new weights forces increased sharing, but it is far from theideal 100%. Moreover the sharing in the first, permuted layer alsoincreases, which should not be the case.

frozen network and adds new weights with low probability.The mask logits corresponding to weights of the previoustask are initialized to 2 (corresponding to P ≈ 0.88), thelogits for newly initialized weights to either 0 (P = 0.5, Fig.15a) or -1 (P ≈ 0.27, Fig. 15b). Although this improved thesharing, it is still far from the target 100%. It also increasedthe sharing of the first layer, which should not share at all.

Such observations confirm once more that P2 is notachieved: modules are not reused. The same functional-ity is redundantly re-learned instead.

E.1. Implementation Details

In the transfer learning setup, we train 10 permutations ofMNIST with the same network. These experiments use 8mask samples per batch. Each phase takes 30k steps. Thelearning rate is 10−2. The network is 4 layers deep, withhidden sizes of 800, 800, 64. We are using a mask loss ofα = 10−5.

F. SCAN ExperimentsWe modify the baseline (Lake & Baroni, 2018) by reduc-ing the size of word embeddings to 16 because SCAN has

Page 13: Are Neural Nets Modular? Inspecting Their Functionality ...people.idsia.ch/~csordas/are_neural_networks_modular.pdf · Neural networks (NNs) whose subnetworks im-plement reusable

Are Neural Nets Modular? Inspecting Their Functionality Through Differentiable Weight Masks

airpla

ne

autom

obile bir

d cat

deer do

gfro

gho

rse ship

truck

0

2

4

6W

eigh

tsre

mov

ed[%

]

(a) Per class

conv 1 conv 2 conv 3 conv 4 output0

2

4

6

Wei

ghts

rem

oved

[%]

(b) Per layer

Figure 16. The percentage of weights removed from a net trainedon CIFAR 10: (a) per class (b) per layer. The output layer isexcluded from the masking process, thus no weights are removedfrom it.

only 13 input and 6 output tokens. We find that the old,full size word embeddings are very redundant. So thereare many possible input-to-hidden weight configurations,greatly decreasing the probability of sampling one of them,which causes the input-to-hidden layer to be removed by thethresholding procedure. The reduced embedding does notsuffer from such problems.

The training procedure uses a batch size of 256 and a gradi-ent clipping of 5. The mask regularizer is β = 3 ∗ 10−5, themask learning rate is 10−2. We train the network for 25ksteps without masks and for another 25k steps for each setof masks.

G. CNN Experiments on CIFAR10G.1. Detailed Description

We use a learning rate of 10−3 and a mask regularizer β =10−4. We pre-train the network for 20000 steps, followed by20000 steps for each of the masks, including the reference.See Table 1 for details of the architecture.

G.2. Analysis of All of the Classes

Fig. 18 shows the confusion matrix difference for all classesin CIFAR 10. The most surprising observation is that thedecrease in performance for each of the classes is substantial,ranging from 40 to 60%. This shows the heavy reliance onclass-exclusive features. The percentage of removed class-specific weights can be seen in Fig. 16.

Analyzing confusion matrix differences yields interestinginsights. “Airplane” is confused with“bird” and “ship”,

airpla

ne

autom

obile bir

d cat

deer do

gfro

gho

rse ship

truck

Predicted label

airplane

automobile

bird

cat

deer

dog

frog

horse

ship

truck

True

labe

l

82± 3 2± 1 4± 1 1± 1 1± 0 1± 0 1± 0 1± 0 5± 1 2± 1

1± 1 90± 1 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 2± 1 5± 1

4± 1 0± 0 76± 3 4± 1 5± 1 4± 1 3± 1 2± 1 1± 0 0± 0

2± 1 0± 0 5± 1 63± 3 4± 1 16± 3 4± 1 3± 1 1± 1 1± 0

2± 0 0± 0 5± 1 4± 1 76± 3 3± 1 2± 1 5± 2 1± 0 0± 0

1± 0 0± 0 4± 1 12± 2 3± 1 73± 4 1± 1 4± 1 0± 0 1± 0

1± 0 1± 0 4± 1 5± 1 3± 1 3± 1 83± 3 1± 0 0± 0 0± 0

1± 0 0± 0 2± 1 3± 1 4± 1 5± 1 0± 0 83± 3 0± 0 1± 1

4± 1 2± 0 1± 0 1± 0 0± 0 0± 0 0± 0 0± 0 90± 2 2± 0

2± 1 7± 1 1± 0 1± 0 0± 0 0± 0 0± 0 0± 0 2± 1 87± 2

20

40

60

80

Figure 17. Confusion matrix on CIFAR10 with masks trained onall classes.

probably because of the blue background. Classes “cat” and“dog” tend to be confused with each other—removing ex-clusive feature detectors for one improves the performanceof the other. “Truck” and “car” are highly related, prob-ably because of similar shapes, such as tires, and similarbackgrounds, such as the road.

Page 14: Are Neural Nets Modular? Inspecting Their Functionality ...people.idsia.ch/~csordas/are_neural_networks_modular.pdf · Neural networks (NNs) whose subnetworks im-plement reusable

Are Neural Nets Modular? Inspecting Their Functionality Through Differentiable Weight Masks

airpla

ne

autom

obile bir

d cat

deer do

gfro

gho

rse ship

truck

Predicted label

airplane

automobile

bird

cat

deer

dog

frog

horse

ship

truck

True

labe

l−59± 6 4± 1 24± 3 2± 1 3± 1 0± 0 1± 0 2± 1 18± 2 5± 1

−1± 1 1± 1 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0

−4± 1 0± 0 4± 1 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0

−2± 1 0± 0 1± 0 1± 1 0± 0 −1± 1 0± 0 0± 0 0± 0 0± 0

−2± 0 0± 0 1± 1 0± 0 1± 1 0± 0 0± 0 0± 1 0± 0 0± 0

−1± 0 0± 0 1± 0 0± 1 0± 0 −1± 1 0± 0 0± 0 0± 0 0± 0

−1± 0 0± 0 1± 1 0± 0 0± 0 0± 0 0± 1 0± 0 0± 0 0± 0

−1± 0 0± 0 0± 0 0± 0 0± 0 0± 1 0± 0 0± 1 0± 0 0± 0

−4± 1 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 3± 1 0± 0

−2± 1 0± 1 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 1± 1

−40

−20

0

20

airpla

ne

autom

obile bir

d cat

deer do

gfro

gho

rse ship

truck

Predicted label

airplane

automobile

bird

cat

deer

dog

frog

horse

ship

truck

True

labe

l

1± 1 −2± 1 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 1 1± 0

3± 2 −43± 13 0± 0 1± 0 0± 0 0± 0 1± 0 0± 0 7± 3 30± 9

0± 0 0± 0 1± 1 0± 0 0± 1 0± 1 0± 0 0± 0 0± 0 0± 0

0± 0 0± 0 0± 0 0± 1 0± 0 0± 1 0± 1 0± 0 0± 0 0± 0

0± 0 0± 0 0± 1 0± 0 0± 1 0± 0 0± 0 0± 0 0± 0 0± 0

0± 0 0± 0 0± 0 0± 1 0± 0 0± 1 0± 0 0± 0 0± 0 0± 0

0± 0 −1± 0 0± 0 0± 1 0± 0 0± 0 0± 1 0± 0 0± 0 0± 0

0± 0 0± 0 0± 0 0± 0 0± 0 0± 1 0± 0 0± 1 0± 0 0± 0

0± 1 −2± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 1± 1 0± 0

0± 0 −7± 1 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 6± 1 −40

−30

−20

−10

0

10

20

30

(a) airplane (b) automobile

airpla

ne

autom

obile bir

d cat

deer do

gfro

gho

rse ship

truck

Predicted label

airplane

automobile

bird

cat

deer

dog

frog

horse

ship

truck

True

labe

l

3± 1 0± 0 −4± 1 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0

0± 0 0± 1 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0

12± 2 0± 0 −61± 4 11± 1 13± 3 10± 2 10± 3 4± 1 1± 1 0± 0

1± 0 0± 0 −5± 1 2± 2 1± 0 1± 1 1± 1 0± 1 0± 0 0± 0

0± 0 0± 0 −5± 1 0± 0 3± 2 1± 0 0± 0 1± 1 0± 0 0± 0

0± 0 0± 0 −4± 1 1± 1 0± 0 2± 1 0± 0 0± 1 0± 0 0± 0

0± 0 0± 0 −4± 1 0± 1 0± 0 0± 0 2± 1 0± 0 0± 0 0± 0

0± 0 0± 0 −2± 1 0± 0 0± 0 0± 1 0± 0 1± 1 0± 0 0± 0

0± 0 0± 0 −1± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 1 0± 0

0± 0 0± 1 −1± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 1 −60

−50

−40

−30

−20

−10

0

10

airpla

ne

autom

obile bir

d cat

deer do

gfro

gho

rse ship

truck

Predicted label

airplane

automobile

bird

cat

deer

dog

frog

horse

ship

truck

True

labe

l

1± 1 0± 0 0± 1 −1± 1 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0

0± 0 0± 1 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 1

0± 1 0± 0 2± 1 −4± 1 0± 0 2± 1 1± 0 0± 0 0± 0 0± 0

1± 1 0± 0 5± 1 −60± 2 4± 1 36± 4 8± 2 4± 1 1± 0 1± 0

0± 0 0± 0 1± 1 −4± 1 1± 1 2± 1 0± 0 1± 1 0± 0 0± 0

0± 0 0± 0 1± 1 −12± 2 0± 0 10± 1 1± 0 0± 1 0± 0 0± 0

0± 0 0± 0 0± 0 −5± 1 0± 0 2± 1 2± 1 0± 0 0± 0 0± 0

0± 0 0± 0 0± 0 −3± 1 0± 1 2± 1 0± 0 1± 1 0± 0 0± 0

0± 1 0± 0 0± 0 −1± 0 0± 0 0± 0 0± 0 0± 0 0± 1 0± 0

0± 0 0± 1 0± 0 −1± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 1−60

−40

−20

0

20

(c) bird (d) cat

airpla

ne

autom

obile bir

d cat

deer do

gfro

gho

rse ship

truck

Predicted label

airplane

automobile

bird

cat

deer

dog

frog

horse

ship

truck

True

labe

l

0± 1 0± 0 1± 0 0± 0 −1± 0 0± 0 0± 0 0± 0 0± 1 0± 0

0± 0 0± 1 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0

0± 0 0± 0 4± 2 0± 0 −5± 1 0± 0 0± 1 0± 0 0± 0 0± 0

0± 0 0± 0 1± 1 2± 2 −4± 1 −1± 1 1± 1 0± 1 0± 0 0± 0

2± 1 0± 0 21± 2 9± 2 −63± 4 6± 1 6± 2 17± 4 1± 1 0± 0

0± 0 0± 0 1± 0 1± 1 −3± 1 0± 1 0± 0 1± 1 0± 0 0± 0

0± 0 0± 0 1± 1 0± 1 −3± 1 0± 1 1± 1 0± 0 0± 0 0± 0

0± 0 0± 0 1± 0 0± 1 −4± 1 0± 0 0± 0 3± 1 0± 0 0± 0

0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 1 0± 0

0± 0 0± 1 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 1 −60

−40

−20

0

20

airpla

ne

autom

obile bir

d cat

deer do

gfro

gho

rse ship

truck

Predicted label

airplane

automobile

bird

cat

deer

dog

frog

horse

ship

truck

True

labe

l

0± 1 0± 0 1± 0 0± 0 0± 0 −1± 0 0± 0 0± 0 0± 0 0± 0

0± 0 0± 1 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0

0± 0 0± 0 3± 2 1± 1 0± 1 −4± 1 0± 1 0± 0 0± 0 0± 0

0± 0 0± 0 1± 0 14± 2 0± 0 −16± 3 1± 1 0± 1 0± 0 0± 0

0± 0 0± 0 1± 1 2± 1 1± 1 −3± 1 0± 0 0± 1 0± 0 0± 0

0± 0 0± 0 7± 2 42± 4 2± 1 −61± 5 2± 1 6± 1 0± 0 0± 0

0± 0 0± 0 0± 1 1± 1 0± 0 −3± 1 1± 2 0± 0 0± 0 0± 0

0± 0 0± 0 1± 0 2± 0 0± 0 −5± 1 0± 0 2± 1 0± 0 0± 0

0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 1 0± 0

0± 0 0± 1 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 1−60

−40

−20

0

20

40

(e) deer (f) dog

airpla

ne

autom

obile bir

d cat

deer do

gfro

gho

rse ship

truck

Predicted label

airplane

automobile

bird

cat

deer

dog

frog

horse

ship

truck

True

labe

l

0± 1 0± 0 1± 0 0± 0 0± 0 0± 0 −1± 0 0± 0 0± 1 0± 0

0± 0 0± 1 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0

0± 0 0± 0 3± 1 0± 0 0± 1 0± 0 −3± 1 0± 0 0± 0 0± 0

0± 0 0± 0 1± 0 3± 1 0± 0 0± 1 −4± 1 0± 1 0± 0 0± 0

0± 0 0± 0 1± 1 0± 0 1± 2 0± 1 −2± 1 0± 0 0± 0 0± 0

0± 0 0± 0 0± 1 1± 1 0± 0 0± 1 −1± 1 0± 0 0± 0 0± 0

0± 0 1± 1 18± 2 22± 4 12± 3 4± 1 −59± 5 0± 0 1± 1 1± 0

0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 1 0± 0 0± 0

0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 1 0± 0

0± 0 0± 1 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 1

−40

−20

0

20

airpla

ne

autom

obile bir

d cat

deer do

gfro

gho

rse ship

truck

Predicted label

airplane

automobile

bird

cat

deer

dog

frog

horse

ship

truck

True

labe

l

0± 1 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 −1± 0 0± 0 0± 0

0± 0 0± 1 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 1

0± 0 0± 0 1± 1 0± 1 0± 1 0± 0 0± 0 −2± 1 0± 0 0± 0

0± 0 0± 0 1± 0 1± 2 0± 1 1± 1 0± 0 −3± 1 0± 0 0± 0

0± 0 0± 0 1± 1 0± 1 4± 2 1± 0 0± 0 −5± 2 0± 0 0± 0

0± 0 0± 0 1± 1 0± 1 1± 0 2± 1 0± 0 −4± 1 0± 0 0± 0

0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 1 −1± 0 0± 0 0± 0

2± 0 0± 0 4± 1 4± 1 19± 4 11± 3 0± 0 −41± 4 0± 0 2± 0

0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 1 0± 0

0± 0 0± 1 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 1 −40

−30

−20

−10

0

10

(g) frog (h) horse

airpla

ne

autom

obile bir

d cat

deer do

gfro

gho

rse ship

truck

Predicted label

airplane

automobile

bird

cat

deer

dog

frog

horse

ship

truck

True

labe

l

3± 1 0± 0 0± 1 0± 0 0± 0 0± 0 0± 0 0± 0 −5± 1 0± 0

0± 0 1± 1 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 −2± 1 0± 0

0± 0 0± 0 1± 1 0± 1 0± 0 0± 0 0± 0 0± 0 −1± 0 0± 0

0± 0 0± 0 0± 1 1± 2 0± 0 0± 1 0± 1 0± 1 −1± 1 0± 0

0± 0 0± 0 0± 1 0± 0 0± 1 0± 0 0± 0 0± 0 −1± 0 0± 0

0± 0 0± 0 0± 0 0± 1 0± 0 0± 1 0± 0 0± 0 0± 0 0± 0

0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 1 0± 0 0± 0 0± 0

0± 0 0± 0 0± 0 0± 0 0± 0 0± 1 0± 0 0± 1 0± 0 0± 0

24± 4 7± 1 1± 0 3± 1 1± 0 0± 0 1± 0 0± 0 −43± 4 5± 1

0± 0 1± 1 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 −2± 1 1± 1 −40

−30

−20

−10

0

10

20

airpla

ne

autom

obile bir

d cat

deer do

gfro

gho

rse ship

truck

Predicted label

airplane

automobile

bird

cat

deer

dog

frog

horse

ship

truck

True

labe

l

1± 1 1± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 1± 1 −2± 1

0± 0 4± 1 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 −5± 1

0± 0 0± 0 1± 1 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0

0± 0 0± 0 0± 0 1± 1 0± 0 −1± 1 0± 1 0± 1 0± 0 −1± 0

0± 0 0± 0 1± 0 0± 0 −1± 1 0± 0 0± 0 0± 1 0± 0 0± 0

0± 0 0± 0 0± 0 1± 1 0± 0 0± 1 0± 0 0± 0 0± 0 −1± 0

0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 1 0± 0 0± 0 0± 0

0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 1± 1 0± 0 −1± 1

0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 0± 0 1± 1 −1± 0

4± 1 41± 5 1± 0 1± 0 0± 0 0± 0 0± 0 1± 0 5± 1 −54± 4

−40

−20

0

20

40

(i) ship (j) truck

Figure 18. The change in confusion matrix for all CIFAR10 classes, when class indicated by the caption is removed.

Page 15: Are Neural Nets Modular? Inspecting Their Functionality ...people.idsia.ch/~csordas/are_neural_networks_modular.pdf · Neural networks (NNs) whose subnetworks im-plement reusable

Are Neural Nets Modular? Inspecting Their Functionality Through Differentiable Weight Masks

Table 1. CNN architecture used for CIFAR 10 experiments

Index Operation Inputs Outputs Kernel Padding Activation Dropout

1 Conv 3 32 3x3 1 ReLU -2 Max pooling 32 32 2x2 0 - -3 Conv 32 64 3x3 1 ReLU -4 Max pooling 64 64 2x2 0 - -5 Conv 64 128 3x3 1 ReLU 0.256 Max pooling 128 128 2x2 0 - -7 Conv 128 256 3x3 1 ReLU 0.58 Spatial average 256 256 - - - -6 Feedforward 256 10 - - Softmax -