published as a conference paper at iclr 2020...published as a conference paper at iclr 2020 we make...
TRANSCRIPT
-
Published as a conference paper at ICLR 2020
THE EARLY PHASE OF NEURAL NETWORK TRAINING
Jonathan Frankle†MIT CSAIL
David J. SchwabCUNY ITSFacebook AI Research
Ari S. MorcosFacebook AI Research
ABSTRACT
Recent studies have shown that many important aspects of neural network learn-ing take place within the very earliest iterations or epochs of training. For exam-ple, sparse, trainable sub-networks emerge (Frankle et al., 2019), gradient descentmoves into a small subspace (Gur-Ari et al., 2018), and the network undergoesa critical period (Achille et al., 2019). Here we examine the changes that deepneural networks undergo during this early phase of training. We perform exten-sive measurements of the network state during these early iterations of trainingand leverage the framework of Frankle et al. (2019) to quantitatively probe theweight distribution and its reliance on various aspects of the dataset. We findthat, within this framework, deep networks are not robust to reinitializing withrandom weights while maintaining signs, and that weight distributions are highlynon-independent even after only a few hundred iterations. Despite this behavior,pre-training with blurred inputs or an auxiliary self-supervised task can approx-imate the changes in supervised networks, suggesting that these changes are notinherently label-dependent, though labels significantly accelerate this process. To-gether, these results help to elucidate the network changes occurring during thispivotal initial period of learning.
1 INTRODUCTION
Over the past decade, methods for successfully training big, deep neural networks have revolution-ized machine learning. Yet surprisingly, the underlying reasons for the success of these approachesremain poorly understood, despite remarkable empirical performance (Santurkar et al., 2018; Zhanget al., 2017). A large body of work has focused on understanding what happens during the laterstages of training (Neyshabur et al., 2019; Yaida, 2019; Chaudhuri & Soatto, 2017; Wei & Schwab,2019), while the initial phase has been less explored. However, a number of distinct observationsindicate that significant and consequential changes are occurring during the most early stage of train-ing. These include the presence of critical periods during training (Achille et al., 2019), the dramaticreshaping of the local loss landscape (Sagun et al., 2017; Gur-Ari et al., 2018), and the necessity ofrewinding in the context of the lottery ticket hypothesis (Frankle et al., 2019). Here we perform athorough investigation of the state of the network in this early stage.
To provide a unified framework for understanding the changes the network undergoes during theearly phase, we employ the methodology of iterative magnitude pruning with rewinding (IMP), asdetailed below, throughout the bulk of this work (Frankle & Carbin, 2019; Frankle et al., 2019). Theinitial lottery ticket hypothesis, which was validated on comparatively small networks, proposed thatsmall, sparse sub-networks found via pruning of converged larger models could be trained to highperformance provided they were initialized with the same values used in the training of the unprunedmodel (Frankle & Carbin, 2019). However, follow-up work found that rewinding the weights totheir values at some iteration early in the training of the unpruned model, rather than to their initialvalues, was necessary to achieve good performance on deeper networks such as ResNets (Frankleet al., 2019). This observation suggests that the changes in the network during this initial phaseare vital for the success of the training of small, sparse sub-networks. As a result, this paradigmprovides a simple and quantitative scheme for measuring the importance of the weights at variouspoints early in training within an actionable and causal framework.
†Work done while an intern at Facebook AI Research.
1
-
Published as a conference paper at ICLR 2020
We make the following contributions, all evaluated across three different network architectures:
1. We provide an in-depth overview of various statistics summarizing learning over the earlypart of training.
2. We evaluate the impact of perturbing the state of the network in various ways during theearly phase of training, finding that:
(i) counter to observations in smaller networks (Zhou et al., 2019), deeper networks arenot robust to reinitializion with random weights, but maintained signs
(ii) the distribution of weights after the early phase of training is already highly non-i.i.d.,as permuting them dramatically harms performance, even when signs are maintained
(iii) both of the above perturbations can roughly be approximated by simply adding noiseto the network weights, though this effect is stronger for (ii) than (i)
3. We measure the data-dependence of the early phase of training, finding that pre-trainingusing only p(x) can approximate the changes that occur in the early phase of training,though pre-training must last for far longer (∼32× longer) and not be fed misleading labels.
2 KNOWN PHENOMENA IN THE EARLY PHASE OF TRAINING
100.0 64.0 41.0 26.2 16.8 10.7 6.9 4.4Percent of Weights Remaining
82
84
86
88
90
92
Eval
Accu
racy
(%)
Resnet-20 (CIFAR-10)
Rewind to 0Rewind to 100Rewind to 250Rewind to 500Rewind to 1000Rewind to 2000
Figure 1: Accuracy of IMP when rewinding tovarious iterations of the early phase for ResNet-20 sub-networks as a function of sparsity level.
Lottery ticket rewinding: The original lot-tery ticket paper (Frankle & Carbin, 2019) re-wound weights to initialization, i.e., k = 0,during IMP. Follow up work on larger modelsdemonstrated that it is necessary to rewind to alater point during training for IMP to succeed,i.e., k
-
Published as a conference paper at ICLR 2020
Training iterations
1-10 10-500 500-2000 2000+
Gradient magnitudes are very large
Rapid motion in weight space
Large sign changes (3% in �rst 10 it)
Gradients converge to roughly constant magnitude
Weight magnitudes increase linearly
Performance increase decelerates, but is still rapidRewinding begins to become highly e�ective
Gradient magnitudes remain roughly constant
Weight magnitude increase slows
Performance slowly asymptotesBene�t of rewinding saturates
Gradient magnitudes reach a minimum before stabilizing 10% higher
Motion in weight space slows, but is still rapid
Performance increases rapidly, reaching over 50% eval accuracy
Figure 2: Rough timeline of the early phase of training for ResNet-20 on CIFAR-10.
et al., 2015), the WRN-16-8 wide residual network (Zagoruyko & Komodakis, 2016), and the VGG-13 network (Simonyan & Zisserman (2015) as adapted by Liu et al. (2019)). Throughout the mainbody of the paper, we show ResNet-20; in Appendix B, we present the same experiments for theother networks. Unless otherwise stated, results were qualitatively similar across all three networks.All experiments in this paper display the mean and standard deviation across five replicates withdifferent random seeds. See Appendix A for further model details.
Iterative magnitude pruning with rewinding: In order to test the effect of various hypothesesabout the state of sparse networks early in training, we use the Iterative Magnitude Pruning withrewinding (IMP) procedure of Frankle et al. (2019) to extract sub-networks from various points intraining that could have learned on their own. The procedure involves training a network to com-pletion, pruning the 20% of weights with the lowest magnitudes globally throughout the network,and rewinding the remaining weights to their values from an earlier iteration k during the initial,pre-pruning training run. This process is iterated to produce networks with high sparsity levels. Asdemonstrated in Frankle et al. (2019), IMP with rewinding leads to sparse sub-networks which cantrain to high performance even at high sparsity levels > 90%.
Figure 1 shows the results of the IMP with rewinding procedure, showing the accuracy of ResNet-20 at increasing sparsity when performing this procedure for several rewinding values of k. Fork ≥ 500, sub-networks can match the performance of the original network with 16.8% of weightsremaining. For k > 2000, essentially no further improvement is observed (not shown).
4 THE STATE OF THE NETWORK EARLY IN TRAINING
Many of the aforementioned papers refer to various points in the “early” part of training. In thissection, we descriptively chart the state of ResNet-20 during the earliest phase of training to providecontext for this related work and our subsequent experiments. We specifically focus on the first4,000 iterations (10 epochs). See Figure A3 for the characterization of additional networks. Weinclude a summary of these results for ResNet-20 as a timeline in Figure 2, and include a broadertimeline including results from several previous papers for ResNet-18 in Figure A1.
As shown in Figure 3, during the earliest ten iterations, the network undergoes substantial change. Itexperiences large gradients that correspond to a rapid increase in distance from the initialization anda large number of sign changes of the weights. After these initial iterations, gradient magnitudesdrop and the rate of change in each of the aforementioned quantities gradually slows through theremainder of the period we observe. Interestingly, gradient magnitudes reach a minimum after thefirst 200 iterations and subsequently increase to a stable level by iteration 500. Evaluation accuracy,improves rapidly, reaching 55% by the end of the first epoch (400 iterations), more than halfway tothe final 91.5%. By 2000 iterations, accuracy approaches 80%.
During the first 4000 iterations of training, we observe three sub-phases. In the first phase, lastingonly the initial few iterations, gradient magnitudes are very large and, consequently, the networkchanges rapidly. In the second phase, lasting about 500 iterations, performance quickly improves,weight magnitudes quickly increase, sign differences from initialization quickly increase, and gra-dient magnitudes reach a minimum before settling at a stable level. Finally, in the third phase, all ofthese quantities continue to change in the same direction, but begin to decelerate.
3
-
Published as a conference paper at ICLR 2020
0 1000 2000 3000 4000Training Iteration
0.2
0.4
0.6
0.8
Eval
Accu
racy
(%)
Eval Accuracy
0 1000 2000 3000 4000Training Iteration
0
5
10
15
20
25
% o
f Sig
ns th
at Di
ffer f
rom
Init Sign Differences
0 1000 2000 3000 4000Training Iteration
0.058
0.060
0.062
0.064
0.066
0.068
Mag
nitu
de
Average Weight Magnitude
0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration
0.20
0.15
0.10
0.05
0.00
0.05
0.10
Mag
nitu
de
Weight Trace
0 1000 2000 3000 4000Training Iteration
0.5
1.0
1.5
2.0
2.5
Eval
Loss
Eval Loss
0 1000 2000 3000 4000Training Iteration
0.160.180.200.220.240.260.280.300.32
Mag
nitu
de
Gradient Magnitude
0 1000 2000 3000 4000Training Iteration
0102030405060
L2 D
istan
ce
L2 Dist
Init Final
0 1000 2000 3000 4000Training Iteration
0.20.30.40.50.60.70.80.91.0
Cosin
e Sim
ilarit
y
Cosine Similarity
Init Final
Figure 3: Basic telemetry about the state of ResNet-20 during the first 4000 iterations (10 epochs).Top row: evaluation accuracy/loss; average weight magnitude; percentage of weights that changesign from initialization; the values of ten randomly-selected weights. Bottom row: gradient magni-tude; L2 distance of weights from their initial values and final values at the end of training; cosinesimilarity of weights from their initial values and final values at the end of training.
100.0 64.0 41.0 26.2 16.8 10.7 6.9 4.4Percent of Weights Remaining
84
86
88
90
92
Eval
Accu
racy
(%)
init: 0 signs: 0init: 500 signs: 500random init signs: 500init: 0 signs: 500init: 500 signs: 0
100.0 64.0 41.0 26.2 16.8 10.7 6.9 4.4Percent of Weights Remaining
84
86
88
90
92Ev
al Ac
cura
cy (%
)
init: 0 signs: 0init: 2000 signs: 2000random init signs: 2000init: 0 signs: 2000init: 2000 signs: 0
Figure 4: Performance of an IMP-derived sub-network of ResNet-20 on CIFAR-10 initialized to thesigns at iteration 0 or k and the magnitudes at iteration 0 or k. Left: k = 500. Right: k = 2000.
5 PERTURBING NEURAL NETWORKS EARLY IN TRAINING
Figure 1 shows that the changes in the network weights over the first 500 iterations of training areessential to enable high performance at high sparsity levels. What features of this weight transfor-mation are necessary to recover increased performance? Can they be summarized by maintainingthe weight signs, but discarding their magnitudes as implied by Zhou et al. (2019)? Can they berepresented distributionally? In this section, we evaluate these questions by perturbing the earlystate of the network in various ways. Concretely, we either add noise or shuffle the weights ofIMP sub-networks of ResNet-20 across different network sub-compenents and examine the effecton the network’s ability to learn thereafter. The sub-networks derived by IMP with rewinding makeit possible to understand the causal impact of perturbations on sub-networks that are as capable asthe full networks but more visibly decline in performance when improperly configured. To enablecomparisons between the experiments in Section 5 and provide a common frame of reference, wemeasure the effective standard deviation of each perturbation, i.e. stddev(wperturb − worig).
5.1 ARE SIGNS ALL YOU NEED?
Zhou et al. (2019) show that, for a set of small convolutional networks, signs alone are sufficientto capture the state of lottery ticket sub-networks. However, it is unclear whether signs are stillsufficient for larger networks early in training. In Figure 4, we investigate the impact of combiningthe magnitudes of the weights from one time-point with the signs from another. We found that thesigns at iteration 500 paired with the magnitudes from initialization (red line) or from a separaterandom initialization (green line) were insufficient to maintain the performance reached by usingboth signs and magnitudes from iteration 500 (orange line), and performance drops to that of using
4
-
Published as a conference paper at ICLR 2020
100.0 64.0 41.0 26.2 16.8 10.7 6.9 4.4Percent of Weights Remaining
84
86
88
90
92
Eval
Accu
racy
(%)
Resnet-20 (CIFAR-10) - Rewinding Iteration 500
Rewind to 0Rewind to 500Shuffle GloballyShuffle LayersShuffle Filters
100.0 64.0 41.0 26.2 16.8 10.7 6.9 4.4Percent of Weights Remaining
84
86
88
90
92
Eval
Accu
racy
(%)
Rewind to 0Rewind to 2000Shuffle GloballyShuffle LayersShuffle Filters
Figure 5: Performance of an IMP-derived ResNet-20 sub-network on CIFAR-10 initialized with theweights at iteration k permuted within various structural elements. Left: k = 500. Right: k = 2000.
100.0 64.0 41.0 26.2 16.8 10.7 6.9 4.4Percent of Weights Remaining
84
86
88
90
92
Eval
Accu
racy
(%)
Rewind to 0Rewind to 500Shuffle Globally, Signs at 500Shuffle Layers, Signs at 500Shuffle Filters, Signs at 500Shuffle Layers, Signs at Init
100.0 64.0 41.0 26.2 16.8 10.7 6.9 4.4Percent of Weights Remaining
84
86
88
90
92
Eval
Accu
racy
(%)
Rewind to 0Rewind to 2000Shuffle Globally, Signs at 2000Shuffle Layers, Signs at 2000Shuffle Filters, Signs at 2000Shuffle Layers, Signs at Init
Figure 6: The effect of training an IMP-derived sub-network of ResNet-20 on CIFAR-10 initializedwith the weights at iteration k as shuffled within various structural elements where shuffling onlyoccurs between weights with the same sign. Left: k = 500. Right: k = 2000.
both magnitudes and signs from initialization (blue line). However, while using the magnitudesfrom iteration 500 and the signs from initialization, performance is still substantially better thaninitialization signs and magnitudes. In addition, the overall perturbation to the network by using themagnitudes at iteration 500 and signs from initialization (mean: 0.0, stddev: 0.033) is smaller thanby using the signs at iteration 500 and the magnitudes from initialization (0.0± 0.042, mean ± std).These results suggest that the change in weight magnitudes over the first 500 iterations of trainingare substantially more important than the change in the signs for enabling subsequent training.
By iteration 2000, however, pairing the iteration 2000 signs with magnitudes from initialization (redline) reaches similar performance to using the signs from initialization and the magnitudes fromiteration 2000 (purple line) though not as high performance as using both from iteration 2000. Thisresult suggests that network signs undergo important changes between iterations 500 and 2000, asonly 9% of signs change during this period. Our results also suggest that counter to the observationsof Zhou et al. (2019) in shallow networks, signs are not sufficient in deeper networks.
5.2 ARE WEIGHT DISTRIBUTIONS I.I.D.?
Can the changes in weights over the first k iterations be approximated distributionally? To mea-sure this, we permuted the weights at iteration k within various structural sub-components of thenetwork (globally, within layers, and within convolutional filters). If networks are robust to thesepermutations, it would suggest that the weights in such sub-compenents might be approximated andsampled from. As Figure 5 shows, however, we found that performance was not robust to shufflingweights globally (green line) or within layers (red line), and drops substantially to no better thanthat of the original initialization (blue line) at both 500 and 2000 iterations.1 Shuffling within filters(purple line) performs slightly better, but results in a smaller overall perturbation (0.0 ± 0.092 fork = 500) than shuffling layerwise (0.0±0.143) or globally (0.0±0.144), suggesting that this changein perturbation strength may simply account for the difference.
1We also considered shuffling within the incoming and outgoing weights for each neuron, but performancewas equivalent to shuffling within layers. We elided these lines for readability.
5
-
Published as a conference paper at ICLR 2020
100.0 64.0 41.0 26.2 16.8 10.7 6.9 4.4Percent of Weights Remaining
84
86
88
90
92
Eval
Accu
racy
(%)
Rewind to 0Rewind to 5000.51.02.03.0
100.0 64.0 41.0 26.2 16.8 10.7 6.9 4.4Percent of Weights Remaining
84
86
88
90
92
Eval
Accu
racy
(%)
Rewind to 0Rewind to 20000.51.02.03.0
Figure 7: The effect of training an IMP-derived sub-network of ResNet-20 on CIFAR-10 initializedwith the weights at iteration k and Gaussian noise of nσ, where σ is the standard deviation of theinitialization distribution for each layer. Left: k = 500. Right: k = 2000.
Are the signs from the rewinding iteration, k, sufficient to recover the damage caused by permuta-tion? In Figure 6, we also consider shuffling only amongst weights that have the same sign. Doing sosubstantially improves the performance of the filter-wise shuffle; however, it also reduces the extentof the overall perturbation (0.0± 0.049 for k = 500). It also improves the performance of shufflingwithin layers slightly for k = 500 and substantially for k = 2000. We attribute the behavior fork = 2000 to the signs just as in Figure 4: when the magnitudes are similar in value (Figure 4 redline) or distribution (Figure 6 red and green lines), using the signs improves performance. Revertingback to the initial signs while shuffling magnitudes within layers (brown line), however, damagesthe network too severely (0.0 ± 0.087 for k = 500) to yield any performance improvement overrandom noise. These results suggest that, while the signs from initialization are not sufficient forhigh performance at high sparsity as shown in Section 5.1, the signs from the rewinding iterationare sufficient to recover the damage caused by permutation, at least to some extent.
5.3 IS IT ALL JUST NOISE?
Some of our previous results suggested that the impact of signs and permutations may simply reduceto adding noise to the weights. To evaluate this hypothesis, we next study the effect of simplyadding Gaussian noise to the network weights at iteration k. To add noise appropriately for layerswith different scales, the standard deviation of the noise added for each layer was normalized to amultiple of the standard deviation σ of the initialization distribution for that layer. In Figure 7, wesee that for iteration k = 500, sub-networks can tolerate 0.5σ to 1σ of noise before performancedegrades back to that of the original initialization at higher levels of noise. For iteration k = 2000,networks are surprisingly robust to noise up to 1σ, and even 2σ exhibits nontrivial performance.
In Figure 8, we plot the performance of each network at a fixed sparsity level as a function of theeffective standard deviation of the noise imposed by each of the aforementioned perturbations. Wefind that the standard deviation of the effective noise explained fairly well the resultant performance(k = 500: r = −0.672, p = 0.008; k = 2000: r = −0.726, p = 0.003). As expected, perturbationsthat preserved the performance of the network generally resulted in smaller changes to the state ofthe network at iteration k. Interestingly, experiments that mixed signs and magnitudes from differentpoints in training (green points) aligned least well with this pattern: the standard deviation of theperturbation is roughly similar among all of these experiments, but the accuracy of the resultingnetworks changes substantially. This result suggests that although the standard deviation of thenoise is certainly indicative of lower accuracy, there are still specific perturbations that, while smallin overall magnitude, can have a large effect on the network’s ability to learn, suggesting that theobserved perturbation effects are not, in fact, just a consequence of noise.
6 THE DATA-DEPENDENCE OF NEURAL NETWORKS EARLY IN TRAINING
Section 5 suggests that the change in network behavior by iteration k is not due to easily-ascertainable, distributional properties of the network weights and signs. Rather, it appears thattraining is required to reach these network states. It is unclear, however, the extent to which variousaspects of the data distribution are necessary. Mainly, is the change in weights during the earlyphase of training dependent on p(x) or p(y|x)? Here, we attempt to answer this question by mea-
6
-
Published as a conference paper at ICLR 2020
0.00 0.05 0.10 0.15 0.20 0.25 0.30Stddev of Effective Noise vs. Rewind to 500
88.5
89.0
89.5
90.0
90.5
91.0
Eval
Accu
racy
Rewind to 500Gaussian 0.5
Gaussian 1.0
Gaussian 2.0Gaussian 3.0random init signs: 500
init: 0 signs: 500
init: 500 signs: 0
Shuffle Globally
Shuffle Layers
Shuffle Filters
Shuffle Globally, Signs at 500Shuffle Layers, Signs at 500
Shuffle Filters, Signs at 500
16.8% Sparsity
0.00 0.05 0.10 0.15 0.20 0.25 0.30Stddev of Effective Noise vs. Rewind to 2000
88.5
89.0
89.5
90.0
90.5
91.0
91.5
Eval
Accu
racy
Rewind to 2000 Gaussian 0.5Gaussian 1.0
Gaussian 2.0
Gaussian 3.0random init signs: 2000
init: 0 signs: 2000
init: 2000 signs: 0
Shuffle Globally Shuffle Layers
Shuffle Filters
Shuffle Globally, Signs at 2000
Shuffle Layers, Signs at 2000
Shuffle Filters, Signs at 2000
16.8% Sparsity
Figure 8: The effective standard deviation of various perturbations as a function of mean evaluationaccuracy (across 5 seeds) at sparsity 26.2%. The mean of each perturbation was approximately 0.Left: k = 500, r = −0.672, p = 0.008; Right: k = 2000, r = −0.726, p = 0.003.
100.0 64.0 41.0 26.2 16.8 10.7 6.9 4.4Percent of Weights Remaining
75
80
85
90
Eval
Accu
racy
(%)
Random Labels
Rewind to 0Rewind to 500Rewind to 2000Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs
100.0 64.0 41.0 26.2 16.8 10.7 6.9 4.4Percent of Weights Remaining
84
86
88
90
92
Eval
Accu
racy
(%)
Self-Supervised Rotation
Rewind to 0Rewind to 500Rewind to 2000Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs
100.0 64.0 41.0 26.2 16.8 10.7 6.9 4.4Percent of Weights Remaining
82
84
86
88
90
92
Eval
Accu
racy
(%)
4x Blurring
Rewind to 0Rewind to 500Rewind to 2000Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs
100.0 64.0 41.0 26.2 16.8 10.7 6.9 4.4Percent of Weights Remaining
82
84
86
88
90
92
Eval
Accu
racy
(%)
4x Blurring + Self-Superivsed Rotation
Rewind to 0Rewind to 500Rewind to 2000Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs
Figure 9: The effect of pre-training ResNet-20 on CIFAR-10 with random labels, self-supervisedrotation, 4x blurring, and 4x blurring and self-supervised rotation.
suring the extent to which we can re-create a favorable network state for sub-network training usingrestricted information from the training data and labels. In particular, we consider pre-training thenetwork with techniques that ignore labels entirely (self-supervised rotation prediction, Section 6.2),provide misleading labels (training with random labels, Section 6.1), or eliminate information fromexamples (blurring training examples Section 6.3).
We first train a randomly-initialized, unpruned network on CIFAR-10 on the pre-training task for aset number of epochs. After pre-training, we train the network normally as if the pre-trained statewere the original initialization. We then use the state of the network at the end of the pre-trainingphase as the “initialization” to find masks for IMP. Finally, we examine the performance of the IMP-pruned sub-networks as initialized using the state after pre-training. This experiment determinesthe extent to which pre-training places the network in a state suitable for sub-network training ascompared to using the state of the network at iteration k of training on the original task.
7
-
Published as a conference paper at ICLR 2020
6.1 RANDOM LABELS
To evaluate whether this phase of training is dependent on underlying structure in the data, we drewinspiration from Zhang et al. (2017) and pre-trained networks on data with randomized labels. Thisexperiment tests whether the input distribution of the training data is sufficient to put the networkin a position from which IMP with rewinding can find a sparse, trainable sub-network despite thepresence of incorrect (not just missing) labels. Figure 9 (upper left) shows that pre-training onrandom labels for up to 10 epochs provides no improvement above rewinding to iteration 0 and thatpre-training for longer begins to hurt accuracy. This result suggests that, though it is still possiblethat labels may not be required for learning, the presence incorrect labels is sufficient to preventlearning which approximates the early phase of training.
6.2 SELF-SUPERVISED ROTATION PREDICTION
What if we remove labels entirely? Is p(x) sufficient to approximate the early phase of training?Historically, neural network training often involved two steps: a self-supervised pre-training phasefollowed by a supervised phase on the target task (Erhan et al., 2010). Here, we consider one suchself-supervised technique: rotation prediction (Gidaris et al., 2018). During the pre-training phase,the network is presented with a training image that has randomly been rotated 90n degrees (wheren ∈ {0, 1, 2, 3}). The network must classify examples by the value of n. If self-supervised pre-training can approximate the early phase of training, it would suggest that p(x) is sufficient on itsown. Indeed, as shown in Figure 9 (upper right), this pre-training regime leads to well-trainable sub-networks, though networks must be trained for many more epochs compared to supervised training(40 compared to 1.25, or a factor of 32×). This result suggests that the labels for the ultimate taskthemselves are not necessary to put the network in such a state (although explicitly misleading labelsare detrimental). We emphasize that the duration of the pre-training phase required is an order ofmagnitude larger than the original rewinding iteration, however, suggesting that labels add importantinformation which accelerates the learning process.
6.3 BLURRING TRAINING EXAMPLES
To probe the importance of p(x) for the early phase of training, we study the extent to which thetraining input distribution is necessary. Namely, we pretrain using blurred training inputs with thecorrect labels. Following Achille et al. (2019), we blur training inputs by downsampling by 4xand then upsampling back to the full size. Figure 9 (bottom left) shows that this pre-training methodsucceeds: after 40 epochs of pre-training, IMP with rewinding can find sub-networks that are similarin performance to those found after training on the original task for 500 iterations (1.25 epochs).
Due to the success of the the rotation and blurring pre-training tasks, we explored the effect ofcombining these pre-training techniques. Doing so tests the extent to which we can discard both thetraining labels and some information from the training inputs. Figure 9 (bottom right) shows thatdoing so provides the network too little information: no amount of pre-training we considered makesit possible for IMP with rewinding to find sub-networks that perform tangibly better than rewindingto iteration 0. Interestingly however, as shown in Appendix B, trainable sub-networks are found forVGG-13 with this pre-training regime, suggesting that different network architectures have differentsensitivities to the deprivation of labels and input content.
6.4 SPARSE PRETRAINING
Since sparse sub-networks are often challenging to train from scratch without the proper initializa-tion (Han et al., 2015; Liu et al., 2019; Frankle & Carbin, 2019), does pre-training make it easierfor sparse neural networks to learn? Doing so would serve as a rough form of curriculum learning(Bengio et al., 2009) for sparse neural networks. We experimented with training sparse sub-networksof ResNet-20 (IMP sub-networks, randomly reinitialized sub-networks, and randomly pruned sub-networks) first on self-supervised rotation and then on the main task, but found no benefit beyondrewinding to iteration 0 (Figure 10). Moreover, doing so when starting from a sub-network rewoundto iteration 500 actually hurts final accuracy. This result suggests that while pre-training is sufficientto approximate the early phase of supervised training with an appropriately structured mask, it is notsufficient to do so with an inappropriate mask.
8
-
Published as a conference paper at ICLR 2020
100.0 64.0 41.0 26.2 16.8 10.7 6.9 4.4Percent of Weights Remaining
82
84
86
88
90
92
Eval
Accu
racy
(%)
Resnet-20 (CIFAR-10) - Sparse Pretrain
Rewind to 0Rewind to 500PretrainPretrain, Random ReinitPretrain, Random Reinit, Randomize Masks
Figure 10: The effect of pretraining sparse sub-networks of Resnet-20 (rewound to iteration 500)with 40 epochs of self-supervised rotation before training on CIFAR-10.
7 DISCUSSION
In this paper, we first performed extensive measurements of various statistics summarizing learningover the early part of training. Notably, we uncovered 3 sub-phases: in the very first iterations, gra-dient magnitudes are anomalously large and motion is rapid. Subsequently, gradients overshoot tosmaller magnitudes before leveling off while performance increases rapidly. Then, learning slowlybegins to decelerate. We then studied a suite of perturbations to the network state in the early phasefinding that, counter to observations in smaller networks (Zhou et al., 2019), deeper networks are notrobust to reinitializing with random weights with maintained signs. We also found that the weightdistribution after the early phase of training is highly non-independent. Finally, we measured thedata-dependence of the early phase with the surprising result that pre-training on a self-supervisedtask yields equivalent performance to late rewinding with IMP.
These results have significant implications for the lottery ticket hypothesis. The seeming necessityof late rewinding calls into question certain interpretations of lottery tickets as well as the ability toidentify sub-networks at initialization. Our observation that weights are highly non-independent atthe rewinding point suggests that the weights at this point cannot be easily approximated, makingapproaches which attempt to “jump” directly to the rewinding point unlikely to succeed. However,our result that labels are not necessary to approximate the rewinding point suggests that the learningduring this phase does not require task-specific information, suggesting that rewinding may not benecessary if networks are pre-trained appropriately.
REFERENCESAlessandro Achille, Matteo Rovere, and Stefano Soatto. Critical learning periods in deep neural
networks. 2019.
Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InProceedings of the 26th annual international conference on machine learning, pp. 41–48. ACM,2009.
Pratik Chaudhuri and Stefano Soatto. Stochastic gradient descent performs variational inference,converges to limit cycles for deep networks. 2017. URL https://arxiv.org/abs/1710.11029.
Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, andSamy Bengio. Why does unsupervised pre-training help deep learning? Journal of MachineLearning Research, 11(Feb):625–660, 2010.
Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neuralnetworks. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJl-b3RcF7.
Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M Roy, and Michael Carbin. Stabilizing thelottery ticket hypothesis. arXiv preprint arXiv:1903.01611, 2019.
9
https://arxiv.org/abs/1710.11029https://arxiv.org/abs/1710.11029https://openreview.net/forum?id=rJl-b3RcF7https://openreview.net/forum?id=rJl-b3RcF7
-
Published as a conference paper at ICLR 2020
Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net optimizationvia hessian eigenvalue density. In Proceedings of the 36th International Conference on MachineLearning, volume 97, pp. 2232–2241, 2019. URL http://proceedings.mlr.press/v97/ghorbani19b.html.
Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning bypredicting image rotations. In International Conference on Learning Representations, 2018. URLhttps://openreview.net/forum?id=S1v4N2l0-.
Aditya Golatkar, Alessandro Achille, and Stefano Soatto. Time matters in regularizing deep net-works: Weight decay and data augmentation affect early learning dynamics, matter little nearconvergence. 2019.
Guy Gur-Ari, Daniel A Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace. arXivpreprint arXiv:1812.04754, 2018.
Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections forefficient neural network. In Advances in neural information processing systems, pp. 1135–1143,2015.
K He, X Zhang, S Ren, and J Sun. Deep residual learning for image recognition. In ComputerVision and Pattern Recogntion (CVPR), volume 5, pp. 6, 2015.
Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training byreducing internal covariate shift. In Proceedings of the 32Nd International Conference on Inter-national Conference on Machine Learning - Volume 37, ICML’15, pp. 448–456. JMLR.org, 2015.URL http://dl.acm.org/citation.cfm?id=3045118.3045167.
Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value ofnetwork pruning. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJlnB3C5Ym.
Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan Srebro. Towardsunderstanding the role of over-parametrization in generalization of neural networks. 2019.
Levent Sagun, Utku Evci, Ugur Guney, Yann Dauphin, and Leon Bottou. Empirical analysis ofthe hessian of over-parametrized neural networks. 2017. URL https://arxiv.org/abs/1706.04454.
Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch normal-ization help optimization? In Advances in neural information processing systems, 2018.
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale imagerecognition. 2015.
Mingwei Wei and David Schwab. How noise during training affects the hessian spectrum in over-parameterized neural networks. 2019.
Sho Yaida. Fluctuation-dissipation relations for stochastic gradient descent. 2019.
Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprintarXiv:1605.07146, 2016.
Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understandingdeep learning requires rethinking generalization. 2017.
Hattie Zhou, Janice Lan, Rosanne Liu, and Jason Yosinski. Deconstructing lottery tickets: Zeros,signs, and the supermask. 2019.
10
http://proceedings.mlr.press/v97/ghorbani19b.htmlhttp://proceedings.mlr.press/v97/ghorbani19b.htmlhttps://openreview.net/forum?id=S1v4N2l0-http://dl.acm.org/citation.cfm?id=3045118.3045167https://openreview.net/forum?id=rJlnB3C5Ymhttps://openreview.net/forum?id=rJlnB3C5Ymhttps://arxiv.org/abs/1706.04454https://arxiv.org/abs/1706.04454
-
Published as a conference paper at ICLR 2020
A MODEL DETAILS
Network Epochs Batch Size Learning Rate Parameters Eval Accuracy
ResNet-20 160 128 0.1 (Mom 0.9) 272K 91.5 ± 0.2%ResNet-56 160 128 0.1 (Mom 0.9) 856K 93.0 ± 0.1%ResNet-18 160 128 0.1 (Mom 0.9) 11.2M 86.8 ± 0.3%VGG-13 160 64 0.1 (Mom 0.9) 9.42M 93.5 ± 0.1%
WRN-16-8 160 128 0.1 (Mom 0.9) 11.1M 94.8 ± 0.1%
Table A1: Summary of the networks we study in this paper. We present ResNet-20 in the main bodyof the paper and the remaining networks in Appendix B.
Table A1 summarizes the networks. All networks follow the same training regime: we train withSGD for 160 epochs starting at learning rate 0.1 (momentum 0.9) and drop the learning rate by afactor of ten at epoch 80 and again at epoch 120. Training includes weight decay with weight 1e-4. Data is augmented with normalization, random flips, and random crops up to four pixels in anydirection.
B EXPERIMENTS FOR OTHER NETWORKS
Training iterations
1-10 10-500 500-2000 2000+
Gradient magnitudes are very large
Rapid motion in weight space
Large sign changes (6% in �rst 10 it) Gradients converge to roughly constant magnitude
Weight magnitudes decrease linearly
Weight magnitudes decrease linearly
Performance increase decelerates, but is still rapidRewinding begins to become highly e�ective
Rewinding begins to become e�ective(~250 iterations)
Hessian eigenspectrum separates(700 iterations; Gur-Ari et al., 2018)
Gradient magnitudes remain roughly constantCritical periods end (~8000 iterations; Achille et al., 2019; Golatkar et al., 2019)Performance slowly asymptotes
Bene�t of rewinding saturates
Gradient magnitudes reach a minimum before stabilizing 10% higher at iteration 250 Motion in weight space slows, but is still rapidPerformance increases rapidly
Figure A1: Rough timeline of the early phase of training for ResNet-18 on CIFAR-10, includingresults from previous papers.
11
-
Published as a conference paper at ICLR 2020
100.0 51.2 26.3 13.5 6.9 3.6 1.9 1.0 0.5Percent of Weights Remaining
86
88
90
92
94
Eval
Accu
racy
(%)
VGG-13 (CIFAR-10)
Rewind to 0Rewind to 10Rewind to 50Rewind to 100Rewind to 250Rewind to 500Rewind to 1000
100.0 51.2 26.2 13.4 6.9 3.5 1.8Percent of Weights Remaining
8082848688909294
Eval
Accu
racy
(%)
Resnet-56 (CIFAR-10)
Rewind to 0Rewind to 100Rewind to 250Rewind to 500Rewind to 1000Rewind to 2000
100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining
8182838485868788
Eval
Accu
racy
(%)
Resnet-18 (CIFAR-10)
Rewind to 0Rewind to 100Rewind to 250Rewind to 500Rewind to 1000Rewind to 2000
100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9Percent of Weights Remaining
90
91
92
93
94
95
Eval
Accu
racy
(%)
WRN-16-8 (CIFAR-10)
Rewind to 0Rewind to 10Rewind to 25Rewind to 50Rewind to 100Rewind to 250Rewind to 500
Figure A2: The effect of IMP rewinding iteration on the accuracy of sub-networks at various levelsof sparsity. Accompanies Figure 1.
12
-
Published as a conference paper at ICLR 2020
0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration
0.2
0.4
0.6
0.8
Eval
Accu
racy
(%)
VGG-13 - Eval Accuracy
0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration
0.5
1.0
1.5
2.0
Eval
Loss
VGG-13 - Eval Loss
0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration
0.0165
0.0170
0.0175
0.0180
0.0185
0.0190
0.0195
Mag
nitu
de
VGG-13 - Average Weight Magnitude
0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration
0
2
4
6
8
10
12
% o
f Sig
ns th
at Di
ffer f
rom
Init
VGG-13 - Sign Differences
0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration
0.030.020.010.000.010.020.03
Mag
nitu
de
VGG-13 - Weight Trace
0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration
0.20
0.25
0.30
0.35
0.40
0.45
0.50
Mag
nitu
de
VGG-13 - Gradient Magnitude
0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration
0
10
20
30
40
50
L2 D
istan
ce
VGG-13 - L2 Dist from Init
0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration
80
85
90
95
L2 D
istan
ce
VGG-13 - L2 Dist from Final
0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration
0.75
0.80
0.85
0.90
0.95
1.00
Cosin
e Sim
ilarit
y
VGG-13 - Cosine Similarity To Init
0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration
0.20
0.25
0.30
0.35
0.40
Cosin
e Sim
ilarit
y
VGG-13 - Cosine Similarity to Final
0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration
0.2
0.4
0.6
0.8
Eval
Accu
racy
(%)
Resnet-56 - Eval Accuracy
0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration
0
1
2
3
4
5
6
Eval
Loss
Resnet-56 - Eval Loss
0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration
0.050
0.051
0.052
0.053
0.054
0.055
0.056
Mag
nitu
deResnet-56 - Average Weight Magnitude
0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration
0.02.55.07.5
10.012.515.017.5
% o
f Sig
ns th
at Di
ffer f
rom
Init
Resnet-56 - Sign Differences
0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration
0.10
0.05
0.00
0.05
0.10
0.15
Mag
nitu
de
Resnet-56 - Weight Trace
0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration
0.20.30.40.50.60.70.80.9
Mag
nitu
de
Resnet-56 - Gradient Magnitude
0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration
0
10
20
30
40
50
L2 D
istan
ce
Resnet-56 - L2 Dist from Init
0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration
65.067.570.072.575.077.580.082.5
L2 D
istan
ce
Resnet-56 - L2 Dist from Final
0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration
0.75
0.80
0.85
0.90
0.95
1.00
Cosin
e Sim
ilarit
y
Resnet-56 - Cosine Similarity To Init
0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration
0.25
0.30
0.35
0.40
0.45
0.50
Cosin
e Sim
ilarit
y
Resnet-56 - Cosine Similarity to Final
0 1000 2000 3000 4000Training Iteration
0.10.20.30.40.50.60.70.80.9
Eval
Accu
racy
(%)
Resnet-18 - Eval Accuracy
0 1000 2000 3000 4000Training Iteration
0.500.751.001.251.501.752.002.25
Eval
Loss
Resnet-18 - Eval Loss
0 1000 2000 3000 4000Training Iteration
0.016
0.017
0.018
0.019
0.020
0.021
0.022
Mag
nitu
de
Resnet-18 - Average Weight Magnitude
0 1000 2000 3000 4000Training Iteration
012345678
% o
f Sig
ns th
at Di
ffer f
rom
Init Resnet-18 - Sign Differences
0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration
0.04
0.02
0.00
0.02
0.04
0.06
Mag
nitu
de
Resnet-18 - Weight Trace
0 1000 2000 3000 4000Training Iteration
0.5
1.0
1.5
2.0
Mag
nitu
de
Resnet-18 - Gradient Magnitude
0 1000 2000 3000 4000Training Iteration
020406080
100120
L2 D
istan
ce
Resnet-18 - L2 Dist
Init Final
0 1000 2000 3000 4000Training Iteration
0.20.30.40.50.60.70.80.91.0
Cosin
e Sim
ilarit
y
Resnet-18 - Cosine Similarity
Init Final
0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration
0.2
0.4
0.6
0.8
Eval
Accu
racy
(%)
WRN-16-8 - Eval Accuracy
0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration
0.5
1.0
1.5
2.0
Eval
Loss
WRN-16-8 - Eval Loss
0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration
0.01700.01750.01800.01850.01900.01950.02000.02050.0210
Mag
nitu
de
WRN-16-8 - Average Weight Magnitude
0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration
02468
101214
% o
f Sig
ns th
at Di
ffer f
rom
Init
WRN-16-8 - Sign Differences
0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration
0.08
0.06
0.04
0.02
0.00
0.02
Mag
nitu
de
WRN-16-8 - Weight Trace
0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration
0.200.250.300.350.400.450.500.550.60
Mag
nitu
de
WRN-16-8 - Gradient Magnitude
0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration
0
10
20
30
40
50
60
L2 D
istan
ce
WRN-16-8 - L2 Dist from Init
0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration
80
85
90
95
100
105
L2 D
istan
ce
WRN-16-8 - L2 Dist from Final
0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration
0.8250.8500.8750.9000.9250.9500.9751.000
Cosin
e Sim
ilarit
y
WRN-16-8 - Cosine Similarity To Init
0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration
0.20
0.25
0.30
0.35
0.40
0.45
Cosin
e Sim
ilarit
y
WRN-16-8 - Cosine Similarity to Final
Figure A3: Basic telemetry about the state of all networks in Table A1 during the first 4000 iterationsof training. Accompanies Figure 3.
13
-
Published as a conference paper at ICLR 2020
100.0 51.2 26.3 13.5 6.9 3.6 1.9 1.0 0.5Percent of Weights Remaining
86
88
90
92
94
Eval
Accu
racy
(%)
VGG-13 (CIFAR-10)
init: 0 signs: 0init: 250 signs: 250random init signs: 250init: 0 signs: 250init: 250 signs: 0
100.0 51.2 26.3 13.5 6.9 3.6 1.9 1.0 0.5Percent of Weights Remaining
86
88
90
92
94
Eval
Accu
racy
(%)
VGG-13 (CIFAR-10)
init: 0 signs: 0init: 500 signs: 500random init signs: 500init: 0 signs: 500init: 500 signs: 0
100.0 51.2 26.2 13.4 6.9 3.5 1.8Percent of Weights Remaining
80828486889092
Eval
Accu
racy
(%)
Resnet-56 (CIFAR-10)
init: 0 signs: 0init: 500 signs: 500random init signs: 500init: 0 signs: 500init: 500 signs: 0
100.0 51.2 26.2 13.4 6.9 3.5 1.8Percent of Weights Remaining
8082848688909294
Eval
Accu
racy
(%)
Resnet-56 (CIFAR-10)
init: 0 signs: 0init: 2000 signs: 2000random init signs: 2000init: 0 signs: 2000init: 2000 signs: 0
100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining
8182838485868788
Eval
Accu
racy
(%)
Resnet-18 (CIFAR-10)
init: 0 signs: 0init: 500 signs: 500random init signs: 500init: 0 signs: 500init: 500 signs: 0
100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining
82
84
86
88
Eval
Accu
racy
(%)
Resnet-18 (CIFAR-10)
init: 0 signs: 0init: 2000 signs: 2000random init signs: 2000init: 0 signs: 2000init: 2000 signs: 0
100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining
8889909192939495
Eval
Accu
racy
(%)
WRN-16-8 (CIFAR-10)
init: 0 signs: 0init: 50 signs: 50random init signs: 50init: 0 signs: 50init: 50 signs: 0
100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining
8889909192939495
Eval
Accu
racy
(%)
WRN-16-8 (CIFAR-10)
init: 0 signs: 0init: 250 signs: 250random init signs: 250init: 0 signs: 250init: 250 signs: 0
Figure A4: The effect of training an IMP-derived sub-network initialized to the signs at iteration 0or k and the magnitudes at iteration 0 or k. Accompanies Figure A4.
14
-
Published as a conference paper at ICLR 2020
100.0 51.2 26.3 13.5 6.9 3.6 1.9 1.0 0.5Percent of Weights Remaining
86
88
90
92
94
Eval
Accu
racy
(%)
VGG-13 (CIFAR-10) - Rewinding Iteration 250
Rewind to 0Rewind to 250Shuffle GloballyShuffle LayersShuffle Filters
100.0 51.2 26.3 13.5 6.9 3.6 1.9 1.0 0.5Percent of Weights Remaining
86
88
90
92
94
Eval
Accu
racy
(%)
VGG-13 (CIFAR-10) - Rewinding Iteration 500
Rewind to 0Rewind to 500Shuffle GloballyShuffle LayersShuffle Filters
100.0 51.2 26.2 13.4 6.9 3.5 1.8Percent of Weights Remaining
80
82
84
86
88
90
92
94
Eval
Accu
racy
(%)
Resnet-56 (CIFAR-10) - Rewinding Iteration 500
Rewind to 0Rewind to 500Shuffle GloballyShuffle LayersShuffle Filters
100.0 51.2 26.2 13.4 6.9 3.5 1.8Percent of Weights Remaining
80
82
84
86
88
90
92
94
Eval
Accu
racy
(%)
Resnet-56 (CIFAR-10) - Rewinding Iteration 2000
Rewind to 0Rewind to 2000Shuffle GloballyShuffle LayersShuffle Filters
100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining
81
82
83
84
85
86
87
88
Eval
Accu
racy
(%)
Resnet-18 (CIFAR-10) - Rewinding Iteration 500
Rewind to 0Rewind to 500Shuffle GloballyShuffle LayersShuffle Filters
100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining
82
84
86
88
Eval
Accu
racy
(%)
Resnet-18 (CIFAR-10) - Rewinding Iteration 2000
Rewind to 0Rewind to 2000Shuffle GloballyShuffle LayersShuffle Filters
100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9Percent of Weights Remaining
91
92
93
94
95
Eval
Accu
racy
(%)
WRN-16-8 (CIFAR-10) - Rewinding Iteration 50
Rewind to 0Rewind to 50Shuffle GloballyShuffle LayersShuffle Filters
100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9Percent of Weights Remaining
91
92
93
94
95
Eval
Accu
racy
(%)
WRN-16-8 (CIFAR-10) - Rewinding Iteration 250
Rewind to 0Rewind to 250Shuffle GloballyShuffle LayersShuffle Filters
Figure A5: The effect of training an IMP-derived sub-network initialized with the weights at itera-tion k as shuffled within various structural elements. Accompanies Figure 5.
15
-
Published as a conference paper at ICLR 2020
100.0 51.2 26.3 13.5 6.9 3.6 1.9 1.0 0.5Percent of Weights Remaining
86
88
90
92
94
Eval
Accu
racy
(%)
VGG-13 (CIFAR-10)
Rewind to 0Rewind to 250Shuffle Globally, Signs at 250Shuffle Layers, Signs at 250Shuffle Filters, Signs at 250Shuffle Layers, Signs at Init
100.0 51.2 26.3 13.5 6.9 3.6 1.9 1.0 0.5Percent of Weights Remaining
86
88
90
92
94
Eval
Accu
racy
(%)
VGG-13 (CIFAR-10)
Rewind to 0Rewind to 500Shuffle Globally, Signs at 500Shuffle Layers, Signs at 500Shuffle Filters, Signs at 500Shuffle Layers, Signs at Init
100.0 51.2 26.2 13.4 6.9 3.5 1.8Percent of Weights Remaining
80
82
84
86
88
90
92
Eval
Accu
racy
(%)
Resnet-56 (CIFAR-10)
Rewind to 0Rewind to 500Shuffle Globally, Signs at 500Shuffle Layers, Signs at 500Shuffle Filters, Signs at 500Shuffle Layers, Signs at Init
100.0 51.2 26.2 13.4 6.9 3.5 1.8Percent of Weights Remaining
80
82
84
86
88
90
92
94
Eval
Accu
racy
(%)
Resnet-56 (CIFAR-10)
Rewind to 0Rewind to 2000Shuffle Globally, Signs at 2000Shuffle Layers, Signs at 2000Shuffle Filters, Signs at 2000Shuffle Layers, Signs at Init
100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining
81
82
83
84
85
86
87
88
Eval
Accu
racy
(%)
Resnet-18 (CIFAR-10) - Rewinding Iteration 500
Rewind to 0Rewind to 500Shuffle Globally, Signs at 500Shuffle Layers, Signs at 500Shuffle Filters, Signs at 500Shuffle Layers, Signs at Init
100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining
82
84
86
88
Eval
Accu
racy
(%)
Resnet-18 (CIFAR-10) - Rewinding Iteration 2000
Rewind to 0Rewind to 2000Shuffle Globally, Signs at 2000Shuffle Layers, Signs at 2000Shuffle Filters, Signs at 2000Shuffle Layers, Signs at Init
100.0 51.2 26.2 13.4 6.9 3.5 1.8Percent of Weights Remaining
91
92
93
94
95
Eval
Accu
racy
(%)
WRN-16-8 (CIFAR-10) - Rewinding Iteration 50
Rewind to 0Rewind to 50Shuffle Globally, Signs at 50Shuffle Layers, Signs at 50Shuffle Filters, Signs at 50Shuffle Layers, Signs at Init
100.0 51.2 26.2 13.4 6.9 3.5 1.8Percent of Weights Remaining
92
93
94
95
Eval
Accu
racy
(%)
WRN-16-8 (CIFAR-10) - Rewinding Iteration 250
Rewind to 0Rewind to 250Shuffle Globally, Signs at 250Shuffle Layers, Signs at 250Shuffle Filters, Signs at 250Shuffle Layers, Signs at Init
Figure A6: The effect of training an IMP-derived sub-network initialized with the weights at iter-ation k as shuffled within various structural elements where shuffling only occurs between weightswith the same sign. Accompanies Figure 6.
16
-
Published as a conference paper at ICLR 2020
100.0 51.2 26.3 13.5 6.9 3.6 1.9 1.0 0.5Percent of Weights Remaining
86
88
90
92
94
Eval
Accu
racy
(%)
VGG-13 (CIFAR-10) - Rewinding Iteration 250
Rewind to 0Rewind to 2500.51.02.03.0
100.0 51.2 26.3 13.5 6.9 3.6 1.9 1.0 0.5Percent of Weights Remaining
86
88
90
92
94
Eval
Accu
racy
(%)
VGG-13 (CIFAR-10) - Rewinding Iteration 500
Rewind to 0Rewind to 5000.51.02.03.0
100.0 64.0 41.0 26.2 16.8 10.7 6.9 4.4Percent of Weights Remaining
87
88
89
90
91
92
93
Eval
Accu
racy
(%)
Resnet-56 (CIFAR-10) - Rewinding Iteration 500
Rewind to 0Rewind to 5000.51.02.03.0
100.0 64.0 41.0 26.2 16.8 10.7 6.9 4.4Percent of Weights Remaining
87
88
89
90
91
92
93
94
Eval
Accu
racy
(%)
Resnet-56 (CIFAR-10) - Rewinding Iteration 2000
Rewind to 0Rewind to 20000.51.02.03.0
100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining
81
82
83
84
85
86
87
Eval
Accu
racy
(%)
Resnet-18 (CIFAR-10) - Rewinding Iteration 500
Rewind to 0Rewind to 5000.51.02.03.0
100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining
82
84
86
88
Eval
Accu
racy
(%)
Resnet-18 (CIFAR-10) - Rewinding Iteration 2000
Rewind to 0Rewind to 20000.51.02.03.0
100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9Percent of Weights Remaining
91
92
93
94
95
Eval
Accu
racy
(%)
WRN-16-8 (CIFAR-10) - Rewinding Iteration 50
Rewind to 0Rewind to 500.51.02.03.0
100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9Percent of Weights Remaining
91
92
93
94
95
Eval
Accu
racy
(%)
WRN-16-8 (CIFAR-10) - Rewinding Iteration 250
Rewind to 0Rewind to 2500.51.02.03.0
Figure A7: The effect of training an IMP-derived sub-network initialized with the weights at itera-tion k and Gaussian noise of nσ, where σ is the standard deviation of the initialization distributionfor each layer. Accompanies Figure 7.
17
-
Published as a conference paper at ICLR 2020
0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175Stddev of Effective Noise vs. Rewind to 250
89.0
89.5
90.0
90.5
91.0
91.5
Eval
Accu
racy
Rewind to 250
Gaussian 0.5
Gaussian 1.0
Gaussian 2.0
Gaussian 3.0
random init signs: 250
init: 0 signs: 250
init: 250 signs: 0
Shuffle GloballyShuffle Layers
Shuffle Filters
Shuffle Globally, Signs at 250
Shuffle Layers, Signs at 250
Shuffle Filters, Signs at 250
VGG-13 - Rewind to 250 - 0.8% Sparsity
0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18Stddev of Effective Noise vs. Rewind to 500
88.5
89.0
89.5
90.0
90.5
91.0
91.5
92.0
92.5
Eval
Accu
racy
Rewind to 500
Gaussian 0.5
Gaussian 1.0
Gaussian 2.0
Gaussian 3.0
random init signs: 500
init: 0 signs: 500
init: 500 signs: 0
Shuffle GloballyShuffle Layers
Shuffle Filters
Shuffle Globally, Signs at 500
Shuffle Layers, Signs at 500
Shuffle Filters, Signs at 500
VGG-13 - Rewind to 500 - 0.8% Sparsity
0.00 0.05 0.10 0.15 0.20 0.25Stddev of Effective Noise vs. Rewind to 500
90.75
91.00
91.25
91.50
91.75
92.00
92.25
92.50
92.75
Eval
Accu
racy
Rewind to 500
Gaussian 0.5
Gaussian 1.0
Gaussian 2.0
Gaussian 3.0
random init signs: 500init: 0 signs: 500
init: 500 signs: 0
Shuffle Globally Shuffle Layers
Shuffle Filters
Shuffle Globally, Signs at 500Shuffle Layers, Signs at 500
Shuffle Filters, Signs at 500
Resnet-56 - Rewind to 500 - 16.8% Sparsity
0.00 0.05 0.10 0.15 0.20 0.25Stddev of Effective Noise vs. Rewind to 2000
91.0
91.5
92.0
92.5
93.0
93.5
Eval
Accu
racy
Rewind to 2000Gaussian 0.5 Gaussian 1.0
Gaussian 2.0
Gaussian 3.0random init signs: 2000init: 0 signs: 2000
init: 2000 signs: 0
Shuffle Globally
Shuffle Layers
Shuffle Filters
Shuffle Globally, Signs at 2000
Shuffle Layers, Signs at 2000
Shuffle Filters, Signs at 2000
Resnet-56 - Rewind to 2000 - 16.8% Sparsity
0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14Stddev of Effective Noise vs. Rewind to 500
86.2
86.4
86.6
86.8
87.0
Eval
Accu
racy
Rewind to 500
Gaussian 0.5
Gaussian 1.0
Gaussian 2.0Gaussian 3.0
random init signs: 500
init: 0 signs: 500 init: 500 signs: 0
Shuffle Globally
Shuffle Filters
Shuffle Globally, Signs at 500
Shuffle Layers, Signs at 500
Shuffle Filters, Signs at 500
Resnet-18 - Rewind to 500 - 3.5% Sparsity
0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14Stddev of Effective Noise vs. Rewind to 2000
86.4
86.6
86.8
87.0
87.2
87.4
87.6
87.8
Eval
Accu
racy
Rewind to 2000Gaussian 0.5
Gaussian 1.0
Gaussian 2.0
Gaussian 3.0random init signs: 2000init: 0 signs: 2000init: 2000 signs: 0
Shuffle Globally
Shuffle LayersShuffle Filters
Shuffle Globally, Signs at 2000
Resnet-18 - Rewind to 2000 - 3.5% Sparsity
0.00 0.02 0.04 0.06 0.08 0.10 0.12Stddev of Effective Noise vs. Rewind to 50
93.6
93.8
94.0
94.2
94.4
94.6
94.8
Eval
Accu
racy
Rewind to 50
Gaussian 0.5
Gaussian 1.0
Gaussian 2.0 Gaussian 3.0
random init signs: 50
init: 0 signs: 50
init: 50 signs: 0
Shuffle GloballyShuffle Layers
Shuffle Filters
Shuffle Globally, Signs at 50
Shuffle Filters, Signs at 50
WRN-16-8 - Rewind to 50 - 6.9% Sparsity
0.00 0.02 0.04 0.06 0.08 0.10 0.12Stddev of Effective Noise vs. Rewind to 250
93.6
93.8
94.0
94.2
94.4
94.6
94.8
95.0
95.2
Eval
Accu
racy
Rewind to 250
Gaussian 0.5
Gaussian 1.0
Gaussian 2.0
Gaussian 3.0
random init signs: 250
init: 0 signs: 250
init: 250 signs: 0
WRN-16-8 - Rewind to 250 - 6.9% Sparsity
Figure A8: The effective standard deviation of each of the perturbations studied in Section 5 as afunction of mean evaluation accuracy (across five seeds). Accompanies Figure 8.
18
-
Published as a conference paper at ICLR 2020
100.0 51.2 26.3 13.5 6.9 3.6 1.9 1.0 0.5Percent of Weights Remaining
84
86
88
90
92
94
Eval
Accu
racy
(%)
VGG-13 (CIFAR-10) - Random Labels
Rewind to 0Rewind to 250Rewind to 500Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs
100.0 51.2 26.2 13.4 6.9 3.5 1.8Percent of Weights Remaining
20
40
60
80
Eval
Accu
racy
(%)
Resnet-56 (CIFAR-10) - Random Labels
Rewind to 0Rewind to 500Rewind to 2000Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs
100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining
82
84
86
88
Eval
Accu
racy
(%)
Resnet-18 (CIFAR-10) - Random Labels
Rewind to 0Rewind to 500Rewind to 2000Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs
100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9Percent of Weights Remaining
86
88
90
92
94
Eval
Accu
racy
(%)
WRN-16-8 (CIFAR-10) - Random Labels
Rewind to 0Rewind to 250Rewind to 500Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs
Figure A9: The effect of pre-training CIFAR-10 with random labels. Accompanies Figure 9.
100.0 51.2 26.3 13.5 6.9 3.6 1.9 1.0 0.5Percent of Weights Remaining
86
88
90
92
94
Eval
Accu
racy
(%)
VGG-13 (CIFAR-10) - Self-Supervised Rotation
Rewind to 0Rewind to 250Rewind to 500Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs
100.0 51.2 26.2 13.4 6.9 3.5 1.8Percent of Weights Remaining
80
82
84
86
88
90
92
94
Eval
Accu
racy
(%)
Resnet-56 (CIFAR-10) - Self-Supervised Rotation
Rewind to 0Rewind to 500Rewind to 2000Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs
100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining
82
84
86
88
Eval
Accu
racy
(%)
Resnet-18 (CIFAR-10) - Self-Supervised Rotation
Rewind to 0Rewind to 500Rewind to 2000Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs
100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9Percent of Weights Remaining
90
91
92
93
94
95
Eval
Accu
racy
(%)
WRN-16-8 (CIFAR-10) - Self-Supervised Rotation
Rewind to 0Rewind to 250Rewind to 500Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs
Figure A10: The effect of pre-training CIFAR-10 with self-supervised rotation. Accompanies Figure9.
19
-
Published as a conference paper at ICLR 2020
100.0 51.2 26.3 13.5 6.9 3.6 1.9 1.0 0.5Percent of Weights Remaining
86
88
90
92
94
Eval
Accu
racy
(%)
VGG-13 (CIFAR-10) - 4x Blurring
Rewind to 0Rewind to 250Rewind to 500Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs
100.0 51.2 26.2 13.4 6.9 3.5 1.8Percent of Weights Remaining
80
82
84
86
88
90
92
94
Eval
Accu
racy
(%)
Resnet-56 (CIFAR-10) - 4x Blurring
Rewind to 0Rewind to 500Rewind to 2000Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs
100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining
82
84
86
88
Eval
Accu
racy
(%)
Resnet-18 (CIFAR-10) - 4x Blurring
Rewind to 0Rewind to 500Rewind to 2000Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs
100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9Percent of Weights Remaining
91
92
93
94
95
Eval
Accu
racy
(%)
WRN-16-8 (CIFAR-10) - 4x Blurring
Rewind to 0Rewind to 250Rewind to 500Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs
Figure A11: The effect of pre-training CIFAR-10 with 4x blurring. Accompanies Figure 9.
100.0 51.2 26.3 13.5 6.9 3.6 1.9 1.0 0.5Percent of Weights Remaining
86
88
90
92
94
Eval
Accu
racy
(%)
VGG-13 (CIFAR-10) - 4x Blurring + Self-Superivsed Rotation
Rewind to 0Rewind to 250Rewind to 500Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs
100.0 51.2 26.2 13.4 6.9 3.5 1.8Percent of Weights Remaining
80
82
84
86
88
90
92
94
Eval
Accu
racy
(%)
Resnet-56 (CIFAR-10) - 4x Blurring + Self-Superivsed Rotation
Rewind to 0Rewind to 500Rewind to 2000Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs
100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining
82
84
86
88
Eval
Accu
racy
(%)
Resnet-18 (CIFAR-10) - 4x Blurring + Self-Superivsed Rotation
Rewind to 0Rewind to 500Rewind to 2000Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs
100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9Percent of Weights Remaining
91
92
93
94
95
Eval
Accu
racy
(%)
WRN-16-8 (CIFAR-10) - 4x Blurring + Self-Superivsed Rotation
Rewind to 0Rewind to 250Rewind to 500Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs
Figure A12: The effect of pre-training CIFAR-10 with 4x blurring and self-supervised rotation.Accompanies Figure 9.
20
IntroductionKnown Phenomena in the Early Phase of TrainingPreliminaries and MethodologyThe State of the Network Early in TrainingPerturbing Neural Networks Early in TrainingAre Signs All You Need?Are Weight Distributions i.i.d.?Is it all just noise?
The Data-Dependence of Neural Networks Early in TrainingRandom labelsSelf-supervised rotation predictionBlurring training examplesSparse Pretraining
DiscussionModel detailsExperiments for Other Networks