published as a conference paper at iclr 2020...published as a conference paper at iclr 2020 we make...

20
Published as a conference paper at ICLR 2020 T HE E ARLY P HASE OF N EURAL N ETWORK T RAINING Jonathan Frankle MIT CSAIL David J. Schwab CUNY ITS Facebook AI Research Ari S. Morcos Facebook AI Research ABSTRACT Recent studies have shown that many important aspects of neural network learn- ing take place within the very earliest iterations or epochs of training. For exam- ple, sparse, trainable sub-networks emerge (Frankle et al., 2019), gradient descent moves into a small subspace (Gur-Ari et al., 2018), and the network undergoes a critical period (Achille et al., 2019). Here we examine the changes that deep neural networks undergo during this early phase of training. We perform exten- sive measurements of the network state during these early iterations of training and leverage the framework of Frankle et al. (2019) to quantitatively probe the weight distribution and its reliance on various aspects of the dataset. We find that, within this framework, deep networks are not robust to reinitializing with random weights while maintaining signs, and that weight distributions are highly non-independent even after only a few hundred iterations. Despite this behavior, pre-training with blurred inputs or an auxiliary self-supervised task can approx- imate the changes in supervised networks, suggesting that these changes are not inherently label-dependent, though labels significantly accelerate this process. To- gether, these results help to elucidate the network changes occurring during this pivotal initial period of learning. 1 I NTRODUCTION Over the past decade, methods for successfully training big, deep neural networks have revolution- ized machine learning. Yet surprisingly, the underlying reasons for the success of these approaches remain poorly understood, despite remarkable empirical performance (Santurkar et al., 2018; Zhang et al., 2017). A large body of work has focused on understanding what happens during the later stages of training (Neyshabur et al., 2019; Yaida, 2019; Chaudhuri & Soatto, 2017; Wei & Schwab, 2019), while the initial phase has been less explored. However, a number of distinct observations indicate that significant and consequential changes are occurring during the most early stage of train- ing. These include the presence of critical periods during training (Achille et al., 2019), the dramatic reshaping of the local loss landscape (Sagun et al., 2017; Gur-Ari et al., 2018), and the necessity of rewinding in the context of the lottery ticket hypothesis (Frankle et al., 2019). Here we perform a thorough investigation of the state of the network in this early stage. To provide a unified framework for understanding the changes the network undergoes during the early phase, we employ the methodology of iterative magnitude pruning with rewinding (IMP), as detailed below, throughout the bulk of this work (Frankle & Carbin, 2019; Frankle et al., 2019). The initial lottery ticket hypothesis, which was validated on comparatively small networks, proposed that small, sparse sub-networks found via pruning of converged larger models could be trained to high performance provided they were initialized with the same values used in the training of the unpruned model (Frankle & Carbin, 2019). However, follow-up work found that rewinding the weights to their values at some iteration early in the training of the unpruned model, rather than to their initial values, was necessary to achieve good performance on deeper networks such as ResNets (Frankle et al., 2019). This observation suggests that the changes in the network during this initial phase are vital for the success of the training of small, sparse sub-networks. As a result, this paradigm provides a simple and quantitative scheme for measuring the importance of the weights at various points early in training within an actionable and causal framework. Work done while an intern at Facebook AI Research. 1

Upload: others

Post on 12-Feb-2021

12 views

Category:

Documents


0 download

TRANSCRIPT

  • Published as a conference paper at ICLR 2020

    THE EARLY PHASE OF NEURAL NETWORK TRAINING

    Jonathan Frankle†MIT CSAIL

    David J. SchwabCUNY ITSFacebook AI Research

    Ari S. MorcosFacebook AI Research

    ABSTRACT

    Recent studies have shown that many important aspects of neural network learn-ing take place within the very earliest iterations or epochs of training. For exam-ple, sparse, trainable sub-networks emerge (Frankle et al., 2019), gradient descentmoves into a small subspace (Gur-Ari et al., 2018), and the network undergoesa critical period (Achille et al., 2019). Here we examine the changes that deepneural networks undergo during this early phase of training. We perform exten-sive measurements of the network state during these early iterations of trainingand leverage the framework of Frankle et al. (2019) to quantitatively probe theweight distribution and its reliance on various aspects of the dataset. We findthat, within this framework, deep networks are not robust to reinitializing withrandom weights while maintaining signs, and that weight distributions are highlynon-independent even after only a few hundred iterations. Despite this behavior,pre-training with blurred inputs or an auxiliary self-supervised task can approx-imate the changes in supervised networks, suggesting that these changes are notinherently label-dependent, though labels significantly accelerate this process. To-gether, these results help to elucidate the network changes occurring during thispivotal initial period of learning.

    1 INTRODUCTION

    Over the past decade, methods for successfully training big, deep neural networks have revolution-ized machine learning. Yet surprisingly, the underlying reasons for the success of these approachesremain poorly understood, despite remarkable empirical performance (Santurkar et al., 2018; Zhanget al., 2017). A large body of work has focused on understanding what happens during the laterstages of training (Neyshabur et al., 2019; Yaida, 2019; Chaudhuri & Soatto, 2017; Wei & Schwab,2019), while the initial phase has been less explored. However, a number of distinct observationsindicate that significant and consequential changes are occurring during the most early stage of train-ing. These include the presence of critical periods during training (Achille et al., 2019), the dramaticreshaping of the local loss landscape (Sagun et al., 2017; Gur-Ari et al., 2018), and the necessity ofrewinding in the context of the lottery ticket hypothesis (Frankle et al., 2019). Here we perform athorough investigation of the state of the network in this early stage.

    To provide a unified framework for understanding the changes the network undergoes during theearly phase, we employ the methodology of iterative magnitude pruning with rewinding (IMP), asdetailed below, throughout the bulk of this work (Frankle & Carbin, 2019; Frankle et al., 2019). Theinitial lottery ticket hypothesis, which was validated on comparatively small networks, proposed thatsmall, sparse sub-networks found via pruning of converged larger models could be trained to highperformance provided they were initialized with the same values used in the training of the unprunedmodel (Frankle & Carbin, 2019). However, follow-up work found that rewinding the weights totheir values at some iteration early in the training of the unpruned model, rather than to their initialvalues, was necessary to achieve good performance on deeper networks such as ResNets (Frankleet al., 2019). This observation suggests that the changes in the network during this initial phaseare vital for the success of the training of small, sparse sub-networks. As a result, this paradigmprovides a simple and quantitative scheme for measuring the importance of the weights at variouspoints early in training within an actionable and causal framework.

    †Work done while an intern at Facebook AI Research.

    1

  • Published as a conference paper at ICLR 2020

    We make the following contributions, all evaluated across three different network architectures:

    1. We provide an in-depth overview of various statistics summarizing learning over the earlypart of training.

    2. We evaluate the impact of perturbing the state of the network in various ways during theearly phase of training, finding that:

    (i) counter to observations in smaller networks (Zhou et al., 2019), deeper networks arenot robust to reinitializion with random weights, but maintained signs

    (ii) the distribution of weights after the early phase of training is already highly non-i.i.d.,as permuting them dramatically harms performance, even when signs are maintained

    (iii) both of the above perturbations can roughly be approximated by simply adding noiseto the network weights, though this effect is stronger for (ii) than (i)

    3. We measure the data-dependence of the early phase of training, finding that pre-trainingusing only p(x) can approximate the changes that occur in the early phase of training,though pre-training must last for far longer (∼32× longer) and not be fed misleading labels.

    2 KNOWN PHENOMENA IN THE EARLY PHASE OF TRAINING

    100.0 64.0 41.0 26.2 16.8 10.7 6.9 4.4Percent of Weights Remaining

    82

    84

    86

    88

    90

    92

    Eval

    Accu

    racy

    (%)

    Resnet-20 (CIFAR-10)

    Rewind to 0Rewind to 100Rewind to 250Rewind to 500Rewind to 1000Rewind to 2000

    Figure 1: Accuracy of IMP when rewinding tovarious iterations of the early phase for ResNet-20 sub-networks as a function of sparsity level.

    Lottery ticket rewinding: The original lot-tery ticket paper (Frankle & Carbin, 2019) re-wound weights to initialization, i.e., k = 0,during IMP. Follow up work on larger modelsdemonstrated that it is necessary to rewind to alater point during training for IMP to succeed,i.e., k

  • Published as a conference paper at ICLR 2020

    Training iterations

    1-10 10-500 500-2000 2000+

    Gradient magnitudes are very large

    Rapid motion in weight space

    Large sign changes (3% in �rst 10 it)

    Gradients converge to roughly constant magnitude

    Weight magnitudes increase linearly

    Performance increase decelerates, but is still rapidRewinding begins to become highly e�ective

    Gradient magnitudes remain roughly constant

    Weight magnitude increase slows

    Performance slowly asymptotesBene�t of rewinding saturates

    Gradient magnitudes reach a minimum before stabilizing 10% higher

    Motion in weight space slows, but is still rapid

    Performance increases rapidly, reaching over 50% eval accuracy

    Figure 2: Rough timeline of the early phase of training for ResNet-20 on CIFAR-10.

    et al., 2015), the WRN-16-8 wide residual network (Zagoruyko & Komodakis, 2016), and the VGG-13 network (Simonyan & Zisserman (2015) as adapted by Liu et al. (2019)). Throughout the mainbody of the paper, we show ResNet-20; in Appendix B, we present the same experiments for theother networks. Unless otherwise stated, results were qualitatively similar across all three networks.All experiments in this paper display the mean and standard deviation across five replicates withdifferent random seeds. See Appendix A for further model details.

    Iterative magnitude pruning with rewinding: In order to test the effect of various hypothesesabout the state of sparse networks early in training, we use the Iterative Magnitude Pruning withrewinding (IMP) procedure of Frankle et al. (2019) to extract sub-networks from various points intraining that could have learned on their own. The procedure involves training a network to com-pletion, pruning the 20% of weights with the lowest magnitudes globally throughout the network,and rewinding the remaining weights to their values from an earlier iteration k during the initial,pre-pruning training run. This process is iterated to produce networks with high sparsity levels. Asdemonstrated in Frankle et al. (2019), IMP with rewinding leads to sparse sub-networks which cantrain to high performance even at high sparsity levels > 90%.

    Figure 1 shows the results of the IMP with rewinding procedure, showing the accuracy of ResNet-20 at increasing sparsity when performing this procedure for several rewinding values of k. Fork ≥ 500, sub-networks can match the performance of the original network with 16.8% of weightsremaining. For k > 2000, essentially no further improvement is observed (not shown).

    4 THE STATE OF THE NETWORK EARLY IN TRAINING

    Many of the aforementioned papers refer to various points in the “early” part of training. In thissection, we descriptively chart the state of ResNet-20 during the earliest phase of training to providecontext for this related work and our subsequent experiments. We specifically focus on the first4,000 iterations (10 epochs). See Figure A3 for the characterization of additional networks. Weinclude a summary of these results for ResNet-20 as a timeline in Figure 2, and include a broadertimeline including results from several previous papers for ResNet-18 in Figure A1.

    As shown in Figure 3, during the earliest ten iterations, the network undergoes substantial change. Itexperiences large gradients that correspond to a rapid increase in distance from the initialization anda large number of sign changes of the weights. After these initial iterations, gradient magnitudesdrop and the rate of change in each of the aforementioned quantities gradually slows through theremainder of the period we observe. Interestingly, gradient magnitudes reach a minimum after thefirst 200 iterations and subsequently increase to a stable level by iteration 500. Evaluation accuracy,improves rapidly, reaching 55% by the end of the first epoch (400 iterations), more than halfway tothe final 91.5%. By 2000 iterations, accuracy approaches 80%.

    During the first 4000 iterations of training, we observe three sub-phases. In the first phase, lastingonly the initial few iterations, gradient magnitudes are very large and, consequently, the networkchanges rapidly. In the second phase, lasting about 500 iterations, performance quickly improves,weight magnitudes quickly increase, sign differences from initialization quickly increase, and gra-dient magnitudes reach a minimum before settling at a stable level. Finally, in the third phase, all ofthese quantities continue to change in the same direction, but begin to decelerate.

    3

  • Published as a conference paper at ICLR 2020

    0 1000 2000 3000 4000Training Iteration

    0.2

    0.4

    0.6

    0.8

    Eval

    Accu

    racy

    (%)

    Eval Accuracy

    0 1000 2000 3000 4000Training Iteration

    0

    5

    10

    15

    20

    25

    % o

    f Sig

    ns th

    at Di

    ffer f

    rom

    Init Sign Differences

    0 1000 2000 3000 4000Training Iteration

    0.058

    0.060

    0.062

    0.064

    0.066

    0.068

    Mag

    nitu

    de

    Average Weight Magnitude

    0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration

    0.20

    0.15

    0.10

    0.05

    0.00

    0.05

    0.10

    Mag

    nitu

    de

    Weight Trace

    0 1000 2000 3000 4000Training Iteration

    0.5

    1.0

    1.5

    2.0

    2.5

    Eval

    Loss

    Eval Loss

    0 1000 2000 3000 4000Training Iteration

    0.160.180.200.220.240.260.280.300.32

    Mag

    nitu

    de

    Gradient Magnitude

    0 1000 2000 3000 4000Training Iteration

    0102030405060

    L2 D

    istan

    ce

    L2 Dist

    Init Final

    0 1000 2000 3000 4000Training Iteration

    0.20.30.40.50.60.70.80.91.0

    Cosin

    e Sim

    ilarit

    y

    Cosine Similarity

    Init Final

    Figure 3: Basic telemetry about the state of ResNet-20 during the first 4000 iterations (10 epochs).Top row: evaluation accuracy/loss; average weight magnitude; percentage of weights that changesign from initialization; the values of ten randomly-selected weights. Bottom row: gradient magni-tude; L2 distance of weights from their initial values and final values at the end of training; cosinesimilarity of weights from their initial values and final values at the end of training.

    100.0 64.0 41.0 26.2 16.8 10.7 6.9 4.4Percent of Weights Remaining

    84

    86

    88

    90

    92

    Eval

    Accu

    racy

    (%)

    init: 0 signs: 0init: 500 signs: 500random init signs: 500init: 0 signs: 500init: 500 signs: 0

    100.0 64.0 41.0 26.2 16.8 10.7 6.9 4.4Percent of Weights Remaining

    84

    86

    88

    90

    92Ev

    al Ac

    cura

    cy (%

    )

    init: 0 signs: 0init: 2000 signs: 2000random init signs: 2000init: 0 signs: 2000init: 2000 signs: 0

    Figure 4: Performance of an IMP-derived sub-network of ResNet-20 on CIFAR-10 initialized to thesigns at iteration 0 or k and the magnitudes at iteration 0 or k. Left: k = 500. Right: k = 2000.

    5 PERTURBING NEURAL NETWORKS EARLY IN TRAINING

    Figure 1 shows that the changes in the network weights over the first 500 iterations of training areessential to enable high performance at high sparsity levels. What features of this weight transfor-mation are necessary to recover increased performance? Can they be summarized by maintainingthe weight signs, but discarding their magnitudes as implied by Zhou et al. (2019)? Can they berepresented distributionally? In this section, we evaluate these questions by perturbing the earlystate of the network in various ways. Concretely, we either add noise or shuffle the weights ofIMP sub-networks of ResNet-20 across different network sub-compenents and examine the effecton the network’s ability to learn thereafter. The sub-networks derived by IMP with rewinding makeit possible to understand the causal impact of perturbations on sub-networks that are as capable asthe full networks but more visibly decline in performance when improperly configured. To enablecomparisons between the experiments in Section 5 and provide a common frame of reference, wemeasure the effective standard deviation of each perturbation, i.e. stddev(wperturb − worig).

    5.1 ARE SIGNS ALL YOU NEED?

    Zhou et al. (2019) show that, for a set of small convolutional networks, signs alone are sufficientto capture the state of lottery ticket sub-networks. However, it is unclear whether signs are stillsufficient for larger networks early in training. In Figure 4, we investigate the impact of combiningthe magnitudes of the weights from one time-point with the signs from another. We found that thesigns at iteration 500 paired with the magnitudes from initialization (red line) or from a separaterandom initialization (green line) were insufficient to maintain the performance reached by usingboth signs and magnitudes from iteration 500 (orange line), and performance drops to that of using

    4

  • Published as a conference paper at ICLR 2020

    100.0 64.0 41.0 26.2 16.8 10.7 6.9 4.4Percent of Weights Remaining

    84

    86

    88

    90

    92

    Eval

    Accu

    racy

    (%)

    Resnet-20 (CIFAR-10) - Rewinding Iteration 500

    Rewind to 0Rewind to 500Shuffle GloballyShuffle LayersShuffle Filters

    100.0 64.0 41.0 26.2 16.8 10.7 6.9 4.4Percent of Weights Remaining

    84

    86

    88

    90

    92

    Eval

    Accu

    racy

    (%)

    Rewind to 0Rewind to 2000Shuffle GloballyShuffle LayersShuffle Filters

    Figure 5: Performance of an IMP-derived ResNet-20 sub-network on CIFAR-10 initialized with theweights at iteration k permuted within various structural elements. Left: k = 500. Right: k = 2000.

    100.0 64.0 41.0 26.2 16.8 10.7 6.9 4.4Percent of Weights Remaining

    84

    86

    88

    90

    92

    Eval

    Accu

    racy

    (%)

    Rewind to 0Rewind to 500Shuffle Globally, Signs at 500Shuffle Layers, Signs at 500Shuffle Filters, Signs at 500Shuffle Layers, Signs at Init

    100.0 64.0 41.0 26.2 16.8 10.7 6.9 4.4Percent of Weights Remaining

    84

    86

    88

    90

    92

    Eval

    Accu

    racy

    (%)

    Rewind to 0Rewind to 2000Shuffle Globally, Signs at 2000Shuffle Layers, Signs at 2000Shuffle Filters, Signs at 2000Shuffle Layers, Signs at Init

    Figure 6: The effect of training an IMP-derived sub-network of ResNet-20 on CIFAR-10 initializedwith the weights at iteration k as shuffled within various structural elements where shuffling onlyoccurs between weights with the same sign. Left: k = 500. Right: k = 2000.

    both magnitudes and signs from initialization (blue line). However, while using the magnitudesfrom iteration 500 and the signs from initialization, performance is still substantially better thaninitialization signs and magnitudes. In addition, the overall perturbation to the network by using themagnitudes at iteration 500 and signs from initialization (mean: 0.0, stddev: 0.033) is smaller thanby using the signs at iteration 500 and the magnitudes from initialization (0.0± 0.042, mean ± std).These results suggest that the change in weight magnitudes over the first 500 iterations of trainingare substantially more important than the change in the signs for enabling subsequent training.

    By iteration 2000, however, pairing the iteration 2000 signs with magnitudes from initialization (redline) reaches similar performance to using the signs from initialization and the magnitudes fromiteration 2000 (purple line) though not as high performance as using both from iteration 2000. Thisresult suggests that network signs undergo important changes between iterations 500 and 2000, asonly 9% of signs change during this period. Our results also suggest that counter to the observationsof Zhou et al. (2019) in shallow networks, signs are not sufficient in deeper networks.

    5.2 ARE WEIGHT DISTRIBUTIONS I.I.D.?

    Can the changes in weights over the first k iterations be approximated distributionally? To mea-sure this, we permuted the weights at iteration k within various structural sub-components of thenetwork (globally, within layers, and within convolutional filters). If networks are robust to thesepermutations, it would suggest that the weights in such sub-compenents might be approximated andsampled from. As Figure 5 shows, however, we found that performance was not robust to shufflingweights globally (green line) or within layers (red line), and drops substantially to no better thanthat of the original initialization (blue line) at both 500 and 2000 iterations.1 Shuffling within filters(purple line) performs slightly better, but results in a smaller overall perturbation (0.0 ± 0.092 fork = 500) than shuffling layerwise (0.0±0.143) or globally (0.0±0.144), suggesting that this changein perturbation strength may simply account for the difference.

    1We also considered shuffling within the incoming and outgoing weights for each neuron, but performancewas equivalent to shuffling within layers. We elided these lines for readability.

    5

  • Published as a conference paper at ICLR 2020

    100.0 64.0 41.0 26.2 16.8 10.7 6.9 4.4Percent of Weights Remaining

    84

    86

    88

    90

    92

    Eval

    Accu

    racy

    (%)

    Rewind to 0Rewind to 5000.51.02.03.0

    100.0 64.0 41.0 26.2 16.8 10.7 6.9 4.4Percent of Weights Remaining

    84

    86

    88

    90

    92

    Eval

    Accu

    racy

    (%)

    Rewind to 0Rewind to 20000.51.02.03.0

    Figure 7: The effect of training an IMP-derived sub-network of ResNet-20 on CIFAR-10 initializedwith the weights at iteration k and Gaussian noise of nσ, where σ is the standard deviation of theinitialization distribution for each layer. Left: k = 500. Right: k = 2000.

    Are the signs from the rewinding iteration, k, sufficient to recover the damage caused by permuta-tion? In Figure 6, we also consider shuffling only amongst weights that have the same sign. Doing sosubstantially improves the performance of the filter-wise shuffle; however, it also reduces the extentof the overall perturbation (0.0± 0.049 for k = 500). It also improves the performance of shufflingwithin layers slightly for k = 500 and substantially for k = 2000. We attribute the behavior fork = 2000 to the signs just as in Figure 4: when the magnitudes are similar in value (Figure 4 redline) or distribution (Figure 6 red and green lines), using the signs improves performance. Revertingback to the initial signs while shuffling magnitudes within layers (brown line), however, damagesthe network too severely (0.0 ± 0.087 for k = 500) to yield any performance improvement overrandom noise. These results suggest that, while the signs from initialization are not sufficient forhigh performance at high sparsity as shown in Section 5.1, the signs from the rewinding iterationare sufficient to recover the damage caused by permutation, at least to some extent.

    5.3 IS IT ALL JUST NOISE?

    Some of our previous results suggested that the impact of signs and permutations may simply reduceto adding noise to the weights. To evaluate this hypothesis, we next study the effect of simplyadding Gaussian noise to the network weights at iteration k. To add noise appropriately for layerswith different scales, the standard deviation of the noise added for each layer was normalized to amultiple of the standard deviation σ of the initialization distribution for that layer. In Figure 7, wesee that for iteration k = 500, sub-networks can tolerate 0.5σ to 1σ of noise before performancedegrades back to that of the original initialization at higher levels of noise. For iteration k = 2000,networks are surprisingly robust to noise up to 1σ, and even 2σ exhibits nontrivial performance.

    In Figure 8, we plot the performance of each network at a fixed sparsity level as a function of theeffective standard deviation of the noise imposed by each of the aforementioned perturbations. Wefind that the standard deviation of the effective noise explained fairly well the resultant performance(k = 500: r = −0.672, p = 0.008; k = 2000: r = −0.726, p = 0.003). As expected, perturbationsthat preserved the performance of the network generally resulted in smaller changes to the state ofthe network at iteration k. Interestingly, experiments that mixed signs and magnitudes from differentpoints in training (green points) aligned least well with this pattern: the standard deviation of theperturbation is roughly similar among all of these experiments, but the accuracy of the resultingnetworks changes substantially. This result suggests that although the standard deviation of thenoise is certainly indicative of lower accuracy, there are still specific perturbations that, while smallin overall magnitude, can have a large effect on the network’s ability to learn, suggesting that theobserved perturbation effects are not, in fact, just a consequence of noise.

    6 THE DATA-DEPENDENCE OF NEURAL NETWORKS EARLY IN TRAINING

    Section 5 suggests that the change in network behavior by iteration k is not due to easily-ascertainable, distributional properties of the network weights and signs. Rather, it appears thattraining is required to reach these network states. It is unclear, however, the extent to which variousaspects of the data distribution are necessary. Mainly, is the change in weights during the earlyphase of training dependent on p(x) or p(y|x)? Here, we attempt to answer this question by mea-

    6

  • Published as a conference paper at ICLR 2020

    0.00 0.05 0.10 0.15 0.20 0.25 0.30Stddev of Effective Noise vs. Rewind to 500

    88.5

    89.0

    89.5

    90.0

    90.5

    91.0

    Eval

    Accu

    racy

    Rewind to 500Gaussian 0.5

    Gaussian 1.0

    Gaussian 2.0Gaussian 3.0random init signs: 500

    init: 0 signs: 500

    init: 500 signs: 0

    Shuffle Globally

    Shuffle Layers

    Shuffle Filters

    Shuffle Globally, Signs at 500Shuffle Layers, Signs at 500

    Shuffle Filters, Signs at 500

    16.8% Sparsity

    0.00 0.05 0.10 0.15 0.20 0.25 0.30Stddev of Effective Noise vs. Rewind to 2000

    88.5

    89.0

    89.5

    90.0

    90.5

    91.0

    91.5

    Eval

    Accu

    racy

    Rewind to 2000 Gaussian 0.5Gaussian 1.0

    Gaussian 2.0

    Gaussian 3.0random init signs: 2000

    init: 0 signs: 2000

    init: 2000 signs: 0

    Shuffle Globally Shuffle Layers

    Shuffle Filters

    Shuffle Globally, Signs at 2000

    Shuffle Layers, Signs at 2000

    Shuffle Filters, Signs at 2000

    16.8% Sparsity

    Figure 8: The effective standard deviation of various perturbations as a function of mean evaluationaccuracy (across 5 seeds) at sparsity 26.2%. The mean of each perturbation was approximately 0.Left: k = 500, r = −0.672, p = 0.008; Right: k = 2000, r = −0.726, p = 0.003.

    100.0 64.0 41.0 26.2 16.8 10.7 6.9 4.4Percent of Weights Remaining

    75

    80

    85

    90

    Eval

    Accu

    racy

    (%)

    Random Labels

    Rewind to 0Rewind to 500Rewind to 2000Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs

    100.0 64.0 41.0 26.2 16.8 10.7 6.9 4.4Percent of Weights Remaining

    84

    86

    88

    90

    92

    Eval

    Accu

    racy

    (%)

    Self-Supervised Rotation

    Rewind to 0Rewind to 500Rewind to 2000Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs

    100.0 64.0 41.0 26.2 16.8 10.7 6.9 4.4Percent of Weights Remaining

    82

    84

    86

    88

    90

    92

    Eval

    Accu

    racy

    (%)

    4x Blurring

    Rewind to 0Rewind to 500Rewind to 2000Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs

    100.0 64.0 41.0 26.2 16.8 10.7 6.9 4.4Percent of Weights Remaining

    82

    84

    86

    88

    90

    92

    Eval

    Accu

    racy

    (%)

    4x Blurring + Self-Superivsed Rotation

    Rewind to 0Rewind to 500Rewind to 2000Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs

    Figure 9: The effect of pre-training ResNet-20 on CIFAR-10 with random labels, self-supervisedrotation, 4x blurring, and 4x blurring and self-supervised rotation.

    suring the extent to which we can re-create a favorable network state for sub-network training usingrestricted information from the training data and labels. In particular, we consider pre-training thenetwork with techniques that ignore labels entirely (self-supervised rotation prediction, Section 6.2),provide misleading labels (training with random labels, Section 6.1), or eliminate information fromexamples (blurring training examples Section 6.3).

    We first train a randomly-initialized, unpruned network on CIFAR-10 on the pre-training task for aset number of epochs. After pre-training, we train the network normally as if the pre-trained statewere the original initialization. We then use the state of the network at the end of the pre-trainingphase as the “initialization” to find masks for IMP. Finally, we examine the performance of the IMP-pruned sub-networks as initialized using the state after pre-training. This experiment determinesthe extent to which pre-training places the network in a state suitable for sub-network training ascompared to using the state of the network at iteration k of training on the original task.

    7

  • Published as a conference paper at ICLR 2020

    6.1 RANDOM LABELS

    To evaluate whether this phase of training is dependent on underlying structure in the data, we drewinspiration from Zhang et al. (2017) and pre-trained networks on data with randomized labels. Thisexperiment tests whether the input distribution of the training data is sufficient to put the networkin a position from which IMP with rewinding can find a sparse, trainable sub-network despite thepresence of incorrect (not just missing) labels. Figure 9 (upper left) shows that pre-training onrandom labels for up to 10 epochs provides no improvement above rewinding to iteration 0 and thatpre-training for longer begins to hurt accuracy. This result suggests that, though it is still possiblethat labels may not be required for learning, the presence incorrect labels is sufficient to preventlearning which approximates the early phase of training.

    6.2 SELF-SUPERVISED ROTATION PREDICTION

    What if we remove labels entirely? Is p(x) sufficient to approximate the early phase of training?Historically, neural network training often involved two steps: a self-supervised pre-training phasefollowed by a supervised phase on the target task (Erhan et al., 2010). Here, we consider one suchself-supervised technique: rotation prediction (Gidaris et al., 2018). During the pre-training phase,the network is presented with a training image that has randomly been rotated 90n degrees (wheren ∈ {0, 1, 2, 3}). The network must classify examples by the value of n. If self-supervised pre-training can approximate the early phase of training, it would suggest that p(x) is sufficient on itsown. Indeed, as shown in Figure 9 (upper right), this pre-training regime leads to well-trainable sub-networks, though networks must be trained for many more epochs compared to supervised training(40 compared to 1.25, or a factor of 32×). This result suggests that the labels for the ultimate taskthemselves are not necessary to put the network in such a state (although explicitly misleading labelsare detrimental). We emphasize that the duration of the pre-training phase required is an order ofmagnitude larger than the original rewinding iteration, however, suggesting that labels add importantinformation which accelerates the learning process.

    6.3 BLURRING TRAINING EXAMPLES

    To probe the importance of p(x) for the early phase of training, we study the extent to which thetraining input distribution is necessary. Namely, we pretrain using blurred training inputs with thecorrect labels. Following Achille et al. (2019), we blur training inputs by downsampling by 4xand then upsampling back to the full size. Figure 9 (bottom left) shows that this pre-training methodsucceeds: after 40 epochs of pre-training, IMP with rewinding can find sub-networks that are similarin performance to those found after training on the original task for 500 iterations (1.25 epochs).

    Due to the success of the the rotation and blurring pre-training tasks, we explored the effect ofcombining these pre-training techniques. Doing so tests the extent to which we can discard both thetraining labels and some information from the training inputs. Figure 9 (bottom right) shows thatdoing so provides the network too little information: no amount of pre-training we considered makesit possible for IMP with rewinding to find sub-networks that perform tangibly better than rewindingto iteration 0. Interestingly however, as shown in Appendix B, trainable sub-networks are found forVGG-13 with this pre-training regime, suggesting that different network architectures have differentsensitivities to the deprivation of labels and input content.

    6.4 SPARSE PRETRAINING

    Since sparse sub-networks are often challenging to train from scratch without the proper initializa-tion (Han et al., 2015; Liu et al., 2019; Frankle & Carbin, 2019), does pre-training make it easierfor sparse neural networks to learn? Doing so would serve as a rough form of curriculum learning(Bengio et al., 2009) for sparse neural networks. We experimented with training sparse sub-networksof ResNet-20 (IMP sub-networks, randomly reinitialized sub-networks, and randomly pruned sub-networks) first on self-supervised rotation and then on the main task, but found no benefit beyondrewinding to iteration 0 (Figure 10). Moreover, doing so when starting from a sub-network rewoundto iteration 500 actually hurts final accuracy. This result suggests that while pre-training is sufficientto approximate the early phase of supervised training with an appropriately structured mask, it is notsufficient to do so with an inappropriate mask.

    8

  • Published as a conference paper at ICLR 2020

    100.0 64.0 41.0 26.2 16.8 10.7 6.9 4.4Percent of Weights Remaining

    82

    84

    86

    88

    90

    92

    Eval

    Accu

    racy

    (%)

    Resnet-20 (CIFAR-10) - Sparse Pretrain

    Rewind to 0Rewind to 500PretrainPretrain, Random ReinitPretrain, Random Reinit, Randomize Masks

    Figure 10: The effect of pretraining sparse sub-networks of Resnet-20 (rewound to iteration 500)with 40 epochs of self-supervised rotation before training on CIFAR-10.

    7 DISCUSSION

    In this paper, we first performed extensive measurements of various statistics summarizing learningover the early part of training. Notably, we uncovered 3 sub-phases: in the very first iterations, gra-dient magnitudes are anomalously large and motion is rapid. Subsequently, gradients overshoot tosmaller magnitudes before leveling off while performance increases rapidly. Then, learning slowlybegins to decelerate. We then studied a suite of perturbations to the network state in the early phasefinding that, counter to observations in smaller networks (Zhou et al., 2019), deeper networks are notrobust to reinitializing with random weights with maintained signs. We also found that the weightdistribution after the early phase of training is highly non-independent. Finally, we measured thedata-dependence of the early phase with the surprising result that pre-training on a self-supervisedtask yields equivalent performance to late rewinding with IMP.

    These results have significant implications for the lottery ticket hypothesis. The seeming necessityof late rewinding calls into question certain interpretations of lottery tickets as well as the ability toidentify sub-networks at initialization. Our observation that weights are highly non-independent atthe rewinding point suggests that the weights at this point cannot be easily approximated, makingapproaches which attempt to “jump” directly to the rewinding point unlikely to succeed. However,our result that labels are not necessary to approximate the rewinding point suggests that the learningduring this phase does not require task-specific information, suggesting that rewinding may not benecessary if networks are pre-trained appropriately.

    REFERENCESAlessandro Achille, Matteo Rovere, and Stefano Soatto. Critical learning periods in deep neural

    networks. 2019.

    Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InProceedings of the 26th annual international conference on machine learning, pp. 41–48. ACM,2009.

    Pratik Chaudhuri and Stefano Soatto. Stochastic gradient descent performs variational inference,converges to limit cycles for deep networks. 2017. URL https://arxiv.org/abs/1710.11029.

    Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, andSamy Bengio. Why does unsupervised pre-training help deep learning? Journal of MachineLearning Research, 11(Feb):625–660, 2010.

    Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neuralnetworks. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJl-b3RcF7.

    Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M Roy, and Michael Carbin. Stabilizing thelottery ticket hypothesis. arXiv preprint arXiv:1903.01611, 2019.

    9

    https://arxiv.org/abs/1710.11029https://arxiv.org/abs/1710.11029https://openreview.net/forum?id=rJl-b3RcF7https://openreview.net/forum?id=rJl-b3RcF7

  • Published as a conference paper at ICLR 2020

    Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net optimizationvia hessian eigenvalue density. In Proceedings of the 36th International Conference on MachineLearning, volume 97, pp. 2232–2241, 2019. URL http://proceedings.mlr.press/v97/ghorbani19b.html.

    Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning bypredicting image rotations. In International Conference on Learning Representations, 2018. URLhttps://openreview.net/forum?id=S1v4N2l0-.

    Aditya Golatkar, Alessandro Achille, and Stefano Soatto. Time matters in regularizing deep net-works: Weight decay and data augmentation affect early learning dynamics, matter little nearconvergence. 2019.

    Guy Gur-Ari, Daniel A Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace. arXivpreprint arXiv:1812.04754, 2018.

    Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections forefficient neural network. In Advances in neural information processing systems, pp. 1135–1143,2015.

    K He, X Zhang, S Ren, and J Sun. Deep residual learning for image recognition. In ComputerVision and Pattern Recogntion (CVPR), volume 5, pp. 6, 2015.

    Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training byreducing internal covariate shift. In Proceedings of the 32Nd International Conference on Inter-national Conference on Machine Learning - Volume 37, ICML’15, pp. 448–456. JMLR.org, 2015.URL http://dl.acm.org/citation.cfm?id=3045118.3045167.

    Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value ofnetwork pruning. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJlnB3C5Ym.

    Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan Srebro. Towardsunderstanding the role of over-parametrization in generalization of neural networks. 2019.

    Levent Sagun, Utku Evci, Ugur Guney, Yann Dauphin, and Leon Bottou. Empirical analysis ofthe hessian of over-parametrized neural networks. 2017. URL https://arxiv.org/abs/1706.04454.

    Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch normal-ization help optimization? In Advances in neural information processing systems, 2018.

    Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale imagerecognition. 2015.

    Mingwei Wei and David Schwab. How noise during training affects the hessian spectrum in over-parameterized neural networks. 2019.

    Sho Yaida. Fluctuation-dissipation relations for stochastic gradient descent. 2019.

    Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprintarXiv:1605.07146, 2016.

    Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understandingdeep learning requires rethinking generalization. 2017.

    Hattie Zhou, Janice Lan, Rosanne Liu, and Jason Yosinski. Deconstructing lottery tickets: Zeros,signs, and the supermask. 2019.

    10

    http://proceedings.mlr.press/v97/ghorbani19b.htmlhttp://proceedings.mlr.press/v97/ghorbani19b.htmlhttps://openreview.net/forum?id=S1v4N2l0-http://dl.acm.org/citation.cfm?id=3045118.3045167https://openreview.net/forum?id=rJlnB3C5Ymhttps://openreview.net/forum?id=rJlnB3C5Ymhttps://arxiv.org/abs/1706.04454https://arxiv.org/abs/1706.04454

  • Published as a conference paper at ICLR 2020

    A MODEL DETAILS

    Network Epochs Batch Size Learning Rate Parameters Eval Accuracy

    ResNet-20 160 128 0.1 (Mom 0.9) 272K 91.5 ± 0.2%ResNet-56 160 128 0.1 (Mom 0.9) 856K 93.0 ± 0.1%ResNet-18 160 128 0.1 (Mom 0.9) 11.2M 86.8 ± 0.3%VGG-13 160 64 0.1 (Mom 0.9) 9.42M 93.5 ± 0.1%

    WRN-16-8 160 128 0.1 (Mom 0.9) 11.1M 94.8 ± 0.1%

    Table A1: Summary of the networks we study in this paper. We present ResNet-20 in the main bodyof the paper and the remaining networks in Appendix B.

    Table A1 summarizes the networks. All networks follow the same training regime: we train withSGD for 160 epochs starting at learning rate 0.1 (momentum 0.9) and drop the learning rate by afactor of ten at epoch 80 and again at epoch 120. Training includes weight decay with weight 1e-4. Data is augmented with normalization, random flips, and random crops up to four pixels in anydirection.

    B EXPERIMENTS FOR OTHER NETWORKS

    Training iterations

    1-10 10-500 500-2000 2000+

    Gradient magnitudes are very large

    Rapid motion in weight space

    Large sign changes (6% in �rst 10 it) Gradients converge to roughly constant magnitude

    Weight magnitudes decrease linearly

    Weight magnitudes decrease linearly

    Performance increase decelerates, but is still rapidRewinding begins to become highly e�ective

    Rewinding begins to become e�ective(~250 iterations)

    Hessian eigenspectrum separates(700 iterations; Gur-Ari et al., 2018)

    Gradient magnitudes remain roughly constantCritical periods end (~8000 iterations; Achille et al., 2019; Golatkar et al., 2019)Performance slowly asymptotes

    Bene�t of rewinding saturates

    Gradient magnitudes reach a minimum before stabilizing 10% higher at iteration 250 Motion in weight space slows, but is still rapidPerformance increases rapidly

    Figure A1: Rough timeline of the early phase of training for ResNet-18 on CIFAR-10, includingresults from previous papers.

    11

  • Published as a conference paper at ICLR 2020

    100.0 51.2 26.3 13.5 6.9 3.6 1.9 1.0 0.5Percent of Weights Remaining

    86

    88

    90

    92

    94

    Eval

    Accu

    racy

    (%)

    VGG-13 (CIFAR-10)

    Rewind to 0Rewind to 10Rewind to 50Rewind to 100Rewind to 250Rewind to 500Rewind to 1000

    100.0 51.2 26.2 13.4 6.9 3.5 1.8Percent of Weights Remaining

    8082848688909294

    Eval

    Accu

    racy

    (%)

    Resnet-56 (CIFAR-10)

    Rewind to 0Rewind to 100Rewind to 250Rewind to 500Rewind to 1000Rewind to 2000

    100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining

    8182838485868788

    Eval

    Accu

    racy

    (%)

    Resnet-18 (CIFAR-10)

    Rewind to 0Rewind to 100Rewind to 250Rewind to 500Rewind to 1000Rewind to 2000

    100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9Percent of Weights Remaining

    90

    91

    92

    93

    94

    95

    Eval

    Accu

    racy

    (%)

    WRN-16-8 (CIFAR-10)

    Rewind to 0Rewind to 10Rewind to 25Rewind to 50Rewind to 100Rewind to 250Rewind to 500

    Figure A2: The effect of IMP rewinding iteration on the accuracy of sub-networks at various levelsof sparsity. Accompanies Figure 1.

    12

  • Published as a conference paper at ICLR 2020

    0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration

    0.2

    0.4

    0.6

    0.8

    Eval

    Accu

    racy

    (%)

    VGG-13 - Eval Accuracy

    0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration

    0.5

    1.0

    1.5

    2.0

    Eval

    Loss

    VGG-13 - Eval Loss

    0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration

    0.0165

    0.0170

    0.0175

    0.0180

    0.0185

    0.0190

    0.0195

    Mag

    nitu

    de

    VGG-13 - Average Weight Magnitude

    0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration

    0

    2

    4

    6

    8

    10

    12

    % o

    f Sig

    ns th

    at Di

    ffer f

    rom

    Init

    VGG-13 - Sign Differences

    0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration

    0.030.020.010.000.010.020.03

    Mag

    nitu

    de

    VGG-13 - Weight Trace

    0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration

    0.20

    0.25

    0.30

    0.35

    0.40

    0.45

    0.50

    Mag

    nitu

    de

    VGG-13 - Gradient Magnitude

    0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration

    0

    10

    20

    30

    40

    50

    L2 D

    istan

    ce

    VGG-13 - L2 Dist from Init

    0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration

    80

    85

    90

    95

    L2 D

    istan

    ce

    VGG-13 - L2 Dist from Final

    0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration

    0.75

    0.80

    0.85

    0.90

    0.95

    1.00

    Cosin

    e Sim

    ilarit

    y

    VGG-13 - Cosine Similarity To Init

    0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration

    0.20

    0.25

    0.30

    0.35

    0.40

    Cosin

    e Sim

    ilarit

    y

    VGG-13 - Cosine Similarity to Final

    0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration

    0.2

    0.4

    0.6

    0.8

    Eval

    Accu

    racy

    (%)

    Resnet-56 - Eval Accuracy

    0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration

    0

    1

    2

    3

    4

    5

    6

    Eval

    Loss

    Resnet-56 - Eval Loss

    0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration

    0.050

    0.051

    0.052

    0.053

    0.054

    0.055

    0.056

    Mag

    nitu

    deResnet-56 - Average Weight Magnitude

    0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration

    0.02.55.07.5

    10.012.515.017.5

    % o

    f Sig

    ns th

    at Di

    ffer f

    rom

    Init

    Resnet-56 - Sign Differences

    0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration

    0.10

    0.05

    0.00

    0.05

    0.10

    0.15

    Mag

    nitu

    de

    Resnet-56 - Weight Trace

    0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration

    0.20.30.40.50.60.70.80.9

    Mag

    nitu

    de

    Resnet-56 - Gradient Magnitude

    0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration

    0

    10

    20

    30

    40

    50

    L2 D

    istan

    ce

    Resnet-56 - L2 Dist from Init

    0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration

    65.067.570.072.575.077.580.082.5

    L2 D

    istan

    ce

    Resnet-56 - L2 Dist from Final

    0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration

    0.75

    0.80

    0.85

    0.90

    0.95

    1.00

    Cosin

    e Sim

    ilarit

    y

    Resnet-56 - Cosine Similarity To Init

    0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration

    0.25

    0.30

    0.35

    0.40

    0.45

    0.50

    Cosin

    e Sim

    ilarit

    y

    Resnet-56 - Cosine Similarity to Final

    0 1000 2000 3000 4000Training Iteration

    0.10.20.30.40.50.60.70.80.9

    Eval

    Accu

    racy

    (%)

    Resnet-18 - Eval Accuracy

    0 1000 2000 3000 4000Training Iteration

    0.500.751.001.251.501.752.002.25

    Eval

    Loss

    Resnet-18 - Eval Loss

    0 1000 2000 3000 4000Training Iteration

    0.016

    0.017

    0.018

    0.019

    0.020

    0.021

    0.022

    Mag

    nitu

    de

    Resnet-18 - Average Weight Magnitude

    0 1000 2000 3000 4000Training Iteration

    012345678

    % o

    f Sig

    ns th

    at Di

    ffer f

    rom

    Init Resnet-18 - Sign Differences

    0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration

    0.04

    0.02

    0.00

    0.02

    0.04

    0.06

    Mag

    nitu

    de

    Resnet-18 - Weight Trace

    0 1000 2000 3000 4000Training Iteration

    0.5

    1.0

    1.5

    2.0

    Mag

    nitu

    de

    Resnet-18 - Gradient Magnitude

    0 1000 2000 3000 4000Training Iteration

    020406080

    100120

    L2 D

    istan

    ce

    Resnet-18 - L2 Dist

    Init Final

    0 1000 2000 3000 4000Training Iteration

    0.20.30.40.50.60.70.80.91.0

    Cosin

    e Sim

    ilarit

    y

    Resnet-18 - Cosine Similarity

    Init Final

    0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration

    0.2

    0.4

    0.6

    0.8

    Eval

    Accu

    racy

    (%)

    WRN-16-8 - Eval Accuracy

    0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration

    0.5

    1.0

    1.5

    2.0

    Eval

    Loss

    WRN-16-8 - Eval Loss

    0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration

    0.01700.01750.01800.01850.01900.01950.02000.02050.0210

    Mag

    nitu

    de

    WRN-16-8 - Average Weight Magnitude

    0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration

    02468

    101214

    % o

    f Sig

    ns th

    at Di

    ffer f

    rom

    Init

    WRN-16-8 - Sign Differences

    0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration

    0.08

    0.06

    0.04

    0.02

    0.00

    0.02

    Mag

    nitu

    de

    WRN-16-8 - Weight Trace

    0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration

    0.200.250.300.350.400.450.500.550.60

    Mag

    nitu

    de

    WRN-16-8 - Gradient Magnitude

    0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration

    0

    10

    20

    30

    40

    50

    60

    L2 D

    istan

    ce

    WRN-16-8 - L2 Dist from Init

    0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration

    80

    85

    90

    95

    100

    105

    L2 D

    istan

    ce

    WRN-16-8 - L2 Dist from Final

    0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration

    0.8250.8500.8750.9000.9250.9500.9751.000

    Cosin

    e Sim

    ilarit

    y

    WRN-16-8 - Cosine Similarity To Init

    0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration

    0.20

    0.25

    0.30

    0.35

    0.40

    0.45

    Cosin

    e Sim

    ilarit

    y

    WRN-16-8 - Cosine Similarity to Final

    Figure A3: Basic telemetry about the state of all networks in Table A1 during the first 4000 iterationsof training. Accompanies Figure 3.

    13

  • Published as a conference paper at ICLR 2020

    100.0 51.2 26.3 13.5 6.9 3.6 1.9 1.0 0.5Percent of Weights Remaining

    86

    88

    90

    92

    94

    Eval

    Accu

    racy

    (%)

    VGG-13 (CIFAR-10)

    init: 0 signs: 0init: 250 signs: 250random init signs: 250init: 0 signs: 250init: 250 signs: 0

    100.0 51.2 26.3 13.5 6.9 3.6 1.9 1.0 0.5Percent of Weights Remaining

    86

    88

    90

    92

    94

    Eval

    Accu

    racy

    (%)

    VGG-13 (CIFAR-10)

    init: 0 signs: 0init: 500 signs: 500random init signs: 500init: 0 signs: 500init: 500 signs: 0

    100.0 51.2 26.2 13.4 6.9 3.5 1.8Percent of Weights Remaining

    80828486889092

    Eval

    Accu

    racy

    (%)

    Resnet-56 (CIFAR-10)

    init: 0 signs: 0init: 500 signs: 500random init signs: 500init: 0 signs: 500init: 500 signs: 0

    100.0 51.2 26.2 13.4 6.9 3.5 1.8Percent of Weights Remaining

    8082848688909294

    Eval

    Accu

    racy

    (%)

    Resnet-56 (CIFAR-10)

    init: 0 signs: 0init: 2000 signs: 2000random init signs: 2000init: 0 signs: 2000init: 2000 signs: 0

    100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining

    8182838485868788

    Eval

    Accu

    racy

    (%)

    Resnet-18 (CIFAR-10)

    init: 0 signs: 0init: 500 signs: 500random init signs: 500init: 0 signs: 500init: 500 signs: 0

    100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining

    82

    84

    86

    88

    Eval

    Accu

    racy

    (%)

    Resnet-18 (CIFAR-10)

    init: 0 signs: 0init: 2000 signs: 2000random init signs: 2000init: 0 signs: 2000init: 2000 signs: 0

    100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining

    8889909192939495

    Eval

    Accu

    racy

    (%)

    WRN-16-8 (CIFAR-10)

    init: 0 signs: 0init: 50 signs: 50random init signs: 50init: 0 signs: 50init: 50 signs: 0

    100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining

    8889909192939495

    Eval

    Accu

    racy

    (%)

    WRN-16-8 (CIFAR-10)

    init: 0 signs: 0init: 250 signs: 250random init signs: 250init: 0 signs: 250init: 250 signs: 0

    Figure A4: The effect of training an IMP-derived sub-network initialized to the signs at iteration 0or k and the magnitudes at iteration 0 or k. Accompanies Figure A4.

    14

  • Published as a conference paper at ICLR 2020

    100.0 51.2 26.3 13.5 6.9 3.6 1.9 1.0 0.5Percent of Weights Remaining

    86

    88

    90

    92

    94

    Eval

    Accu

    racy

    (%)

    VGG-13 (CIFAR-10) - Rewinding Iteration 250

    Rewind to 0Rewind to 250Shuffle GloballyShuffle LayersShuffle Filters

    100.0 51.2 26.3 13.5 6.9 3.6 1.9 1.0 0.5Percent of Weights Remaining

    86

    88

    90

    92

    94

    Eval

    Accu

    racy

    (%)

    VGG-13 (CIFAR-10) - Rewinding Iteration 500

    Rewind to 0Rewind to 500Shuffle GloballyShuffle LayersShuffle Filters

    100.0 51.2 26.2 13.4 6.9 3.5 1.8Percent of Weights Remaining

    80

    82

    84

    86

    88

    90

    92

    94

    Eval

    Accu

    racy

    (%)

    Resnet-56 (CIFAR-10) - Rewinding Iteration 500

    Rewind to 0Rewind to 500Shuffle GloballyShuffle LayersShuffle Filters

    100.0 51.2 26.2 13.4 6.9 3.5 1.8Percent of Weights Remaining

    80

    82

    84

    86

    88

    90

    92

    94

    Eval

    Accu

    racy

    (%)

    Resnet-56 (CIFAR-10) - Rewinding Iteration 2000

    Rewind to 0Rewind to 2000Shuffle GloballyShuffle LayersShuffle Filters

    100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining

    81

    82

    83

    84

    85

    86

    87

    88

    Eval

    Accu

    racy

    (%)

    Resnet-18 (CIFAR-10) - Rewinding Iteration 500

    Rewind to 0Rewind to 500Shuffle GloballyShuffle LayersShuffle Filters

    100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining

    82

    84

    86

    88

    Eval

    Accu

    racy

    (%)

    Resnet-18 (CIFAR-10) - Rewinding Iteration 2000

    Rewind to 0Rewind to 2000Shuffle GloballyShuffle LayersShuffle Filters

    100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9Percent of Weights Remaining

    91

    92

    93

    94

    95

    Eval

    Accu

    racy

    (%)

    WRN-16-8 (CIFAR-10) - Rewinding Iteration 50

    Rewind to 0Rewind to 50Shuffle GloballyShuffle LayersShuffle Filters

    100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9Percent of Weights Remaining

    91

    92

    93

    94

    95

    Eval

    Accu

    racy

    (%)

    WRN-16-8 (CIFAR-10) - Rewinding Iteration 250

    Rewind to 0Rewind to 250Shuffle GloballyShuffle LayersShuffle Filters

    Figure A5: The effect of training an IMP-derived sub-network initialized with the weights at itera-tion k as shuffled within various structural elements. Accompanies Figure 5.

    15

  • Published as a conference paper at ICLR 2020

    100.0 51.2 26.3 13.5 6.9 3.6 1.9 1.0 0.5Percent of Weights Remaining

    86

    88

    90

    92

    94

    Eval

    Accu

    racy

    (%)

    VGG-13 (CIFAR-10)

    Rewind to 0Rewind to 250Shuffle Globally, Signs at 250Shuffle Layers, Signs at 250Shuffle Filters, Signs at 250Shuffle Layers, Signs at Init

    100.0 51.2 26.3 13.5 6.9 3.6 1.9 1.0 0.5Percent of Weights Remaining

    86

    88

    90

    92

    94

    Eval

    Accu

    racy

    (%)

    VGG-13 (CIFAR-10)

    Rewind to 0Rewind to 500Shuffle Globally, Signs at 500Shuffle Layers, Signs at 500Shuffle Filters, Signs at 500Shuffle Layers, Signs at Init

    100.0 51.2 26.2 13.4 6.9 3.5 1.8Percent of Weights Remaining

    80

    82

    84

    86

    88

    90

    92

    Eval

    Accu

    racy

    (%)

    Resnet-56 (CIFAR-10)

    Rewind to 0Rewind to 500Shuffle Globally, Signs at 500Shuffle Layers, Signs at 500Shuffle Filters, Signs at 500Shuffle Layers, Signs at Init

    100.0 51.2 26.2 13.4 6.9 3.5 1.8Percent of Weights Remaining

    80

    82

    84

    86

    88

    90

    92

    94

    Eval

    Accu

    racy

    (%)

    Resnet-56 (CIFAR-10)

    Rewind to 0Rewind to 2000Shuffle Globally, Signs at 2000Shuffle Layers, Signs at 2000Shuffle Filters, Signs at 2000Shuffle Layers, Signs at Init

    100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining

    81

    82

    83

    84

    85

    86

    87

    88

    Eval

    Accu

    racy

    (%)

    Resnet-18 (CIFAR-10) - Rewinding Iteration 500

    Rewind to 0Rewind to 500Shuffle Globally, Signs at 500Shuffle Layers, Signs at 500Shuffle Filters, Signs at 500Shuffle Layers, Signs at Init

    100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining

    82

    84

    86

    88

    Eval

    Accu

    racy

    (%)

    Resnet-18 (CIFAR-10) - Rewinding Iteration 2000

    Rewind to 0Rewind to 2000Shuffle Globally, Signs at 2000Shuffle Layers, Signs at 2000Shuffle Filters, Signs at 2000Shuffle Layers, Signs at Init

    100.0 51.2 26.2 13.4 6.9 3.5 1.8Percent of Weights Remaining

    91

    92

    93

    94

    95

    Eval

    Accu

    racy

    (%)

    WRN-16-8 (CIFAR-10) - Rewinding Iteration 50

    Rewind to 0Rewind to 50Shuffle Globally, Signs at 50Shuffle Layers, Signs at 50Shuffle Filters, Signs at 50Shuffle Layers, Signs at Init

    100.0 51.2 26.2 13.4 6.9 3.5 1.8Percent of Weights Remaining

    92

    93

    94

    95

    Eval

    Accu

    racy

    (%)

    WRN-16-8 (CIFAR-10) - Rewinding Iteration 250

    Rewind to 0Rewind to 250Shuffle Globally, Signs at 250Shuffle Layers, Signs at 250Shuffle Filters, Signs at 250Shuffle Layers, Signs at Init

    Figure A6: The effect of training an IMP-derived sub-network initialized with the weights at iter-ation k as shuffled within various structural elements where shuffling only occurs between weightswith the same sign. Accompanies Figure 6.

    16

  • Published as a conference paper at ICLR 2020

    100.0 51.2 26.3 13.5 6.9 3.6 1.9 1.0 0.5Percent of Weights Remaining

    86

    88

    90

    92

    94

    Eval

    Accu

    racy

    (%)

    VGG-13 (CIFAR-10) - Rewinding Iteration 250

    Rewind to 0Rewind to 2500.51.02.03.0

    100.0 51.2 26.3 13.5 6.9 3.6 1.9 1.0 0.5Percent of Weights Remaining

    86

    88

    90

    92

    94

    Eval

    Accu

    racy

    (%)

    VGG-13 (CIFAR-10) - Rewinding Iteration 500

    Rewind to 0Rewind to 5000.51.02.03.0

    100.0 64.0 41.0 26.2 16.8 10.7 6.9 4.4Percent of Weights Remaining

    87

    88

    89

    90

    91

    92

    93

    Eval

    Accu

    racy

    (%)

    Resnet-56 (CIFAR-10) - Rewinding Iteration 500

    Rewind to 0Rewind to 5000.51.02.03.0

    100.0 64.0 41.0 26.2 16.8 10.7 6.9 4.4Percent of Weights Remaining

    87

    88

    89

    90

    91

    92

    93

    94

    Eval

    Accu

    racy

    (%)

    Resnet-56 (CIFAR-10) - Rewinding Iteration 2000

    Rewind to 0Rewind to 20000.51.02.03.0

    100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining

    81

    82

    83

    84

    85

    86

    87

    Eval

    Accu

    racy

    (%)

    Resnet-18 (CIFAR-10) - Rewinding Iteration 500

    Rewind to 0Rewind to 5000.51.02.03.0

    100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining

    82

    84

    86

    88

    Eval

    Accu

    racy

    (%)

    Resnet-18 (CIFAR-10) - Rewinding Iteration 2000

    Rewind to 0Rewind to 20000.51.02.03.0

    100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9Percent of Weights Remaining

    91

    92

    93

    94

    95

    Eval

    Accu

    racy

    (%)

    WRN-16-8 (CIFAR-10) - Rewinding Iteration 50

    Rewind to 0Rewind to 500.51.02.03.0

    100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9Percent of Weights Remaining

    91

    92

    93

    94

    95

    Eval

    Accu

    racy

    (%)

    WRN-16-8 (CIFAR-10) - Rewinding Iteration 250

    Rewind to 0Rewind to 2500.51.02.03.0

    Figure A7: The effect of training an IMP-derived sub-network initialized with the weights at itera-tion k and Gaussian noise of nσ, where σ is the standard deviation of the initialization distributionfor each layer. Accompanies Figure 7.

    17

  • Published as a conference paper at ICLR 2020

    0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175Stddev of Effective Noise vs. Rewind to 250

    89.0

    89.5

    90.0

    90.5

    91.0

    91.5

    Eval

    Accu

    racy

    Rewind to 250

    Gaussian 0.5

    Gaussian 1.0

    Gaussian 2.0

    Gaussian 3.0

    random init signs: 250

    init: 0 signs: 250

    init: 250 signs: 0

    Shuffle GloballyShuffle Layers

    Shuffle Filters

    Shuffle Globally, Signs at 250

    Shuffle Layers, Signs at 250

    Shuffle Filters, Signs at 250

    VGG-13 - Rewind to 250 - 0.8% Sparsity

    0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18Stddev of Effective Noise vs. Rewind to 500

    88.5

    89.0

    89.5

    90.0

    90.5

    91.0

    91.5

    92.0

    92.5

    Eval

    Accu

    racy

    Rewind to 500

    Gaussian 0.5

    Gaussian 1.0

    Gaussian 2.0

    Gaussian 3.0

    random init signs: 500

    init: 0 signs: 500

    init: 500 signs: 0

    Shuffle GloballyShuffle Layers

    Shuffle Filters

    Shuffle Globally, Signs at 500

    Shuffle Layers, Signs at 500

    Shuffle Filters, Signs at 500

    VGG-13 - Rewind to 500 - 0.8% Sparsity

    0.00 0.05 0.10 0.15 0.20 0.25Stddev of Effective Noise vs. Rewind to 500

    90.75

    91.00

    91.25

    91.50

    91.75

    92.00

    92.25

    92.50

    92.75

    Eval

    Accu

    racy

    Rewind to 500

    Gaussian 0.5

    Gaussian 1.0

    Gaussian 2.0

    Gaussian 3.0

    random init signs: 500init: 0 signs: 500

    init: 500 signs: 0

    Shuffle Globally Shuffle Layers

    Shuffle Filters

    Shuffle Globally, Signs at 500Shuffle Layers, Signs at 500

    Shuffle Filters, Signs at 500

    Resnet-56 - Rewind to 500 - 16.8% Sparsity

    0.00 0.05 0.10 0.15 0.20 0.25Stddev of Effective Noise vs. Rewind to 2000

    91.0

    91.5

    92.0

    92.5

    93.0

    93.5

    Eval

    Accu

    racy

    Rewind to 2000Gaussian 0.5 Gaussian 1.0

    Gaussian 2.0

    Gaussian 3.0random init signs: 2000init: 0 signs: 2000

    init: 2000 signs: 0

    Shuffle Globally

    Shuffle Layers

    Shuffle Filters

    Shuffle Globally, Signs at 2000

    Shuffle Layers, Signs at 2000

    Shuffle Filters, Signs at 2000

    Resnet-56 - Rewind to 2000 - 16.8% Sparsity

    0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14Stddev of Effective Noise vs. Rewind to 500

    86.2

    86.4

    86.6

    86.8

    87.0

    Eval

    Accu

    racy

    Rewind to 500

    Gaussian 0.5

    Gaussian 1.0

    Gaussian 2.0Gaussian 3.0

    random init signs: 500

    init: 0 signs: 500 init: 500 signs: 0

    Shuffle Globally

    Shuffle Filters

    Shuffle Globally, Signs at 500

    Shuffle Layers, Signs at 500

    Shuffle Filters, Signs at 500

    Resnet-18 - Rewind to 500 - 3.5% Sparsity

    0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14Stddev of Effective Noise vs. Rewind to 2000

    86.4

    86.6

    86.8

    87.0

    87.2

    87.4

    87.6

    87.8

    Eval

    Accu

    racy

    Rewind to 2000Gaussian 0.5

    Gaussian 1.0

    Gaussian 2.0

    Gaussian 3.0random init signs: 2000init: 0 signs: 2000init: 2000 signs: 0

    Shuffle Globally

    Shuffle LayersShuffle Filters

    Shuffle Globally, Signs at 2000

    Resnet-18 - Rewind to 2000 - 3.5% Sparsity

    0.00 0.02 0.04 0.06 0.08 0.10 0.12Stddev of Effective Noise vs. Rewind to 50

    93.6

    93.8

    94.0

    94.2

    94.4

    94.6

    94.8

    Eval

    Accu

    racy

    Rewind to 50

    Gaussian 0.5

    Gaussian 1.0

    Gaussian 2.0 Gaussian 3.0

    random init signs: 50

    init: 0 signs: 50

    init: 50 signs: 0

    Shuffle GloballyShuffle Layers

    Shuffle Filters

    Shuffle Globally, Signs at 50

    Shuffle Filters, Signs at 50

    WRN-16-8 - Rewind to 50 - 6.9% Sparsity

    0.00 0.02 0.04 0.06 0.08 0.10 0.12Stddev of Effective Noise vs. Rewind to 250

    93.6

    93.8

    94.0

    94.2

    94.4

    94.6

    94.8

    95.0

    95.2

    Eval

    Accu

    racy

    Rewind to 250

    Gaussian 0.5

    Gaussian 1.0

    Gaussian 2.0

    Gaussian 3.0

    random init signs: 250

    init: 0 signs: 250

    init: 250 signs: 0

    WRN-16-8 - Rewind to 250 - 6.9% Sparsity

    Figure A8: The effective standard deviation of each of the perturbations studied in Section 5 as afunction of mean evaluation accuracy (across five seeds). Accompanies Figure 8.

    18

  • Published as a conference paper at ICLR 2020

    100.0 51.2 26.3 13.5 6.9 3.6 1.9 1.0 0.5Percent of Weights Remaining

    84

    86

    88

    90

    92

    94

    Eval

    Accu

    racy

    (%)

    VGG-13 (CIFAR-10) - Random Labels

    Rewind to 0Rewind to 250Rewind to 500Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs

    100.0 51.2 26.2 13.4 6.9 3.5 1.8Percent of Weights Remaining

    20

    40

    60

    80

    Eval

    Accu

    racy

    (%)

    Resnet-56 (CIFAR-10) - Random Labels

    Rewind to 0Rewind to 500Rewind to 2000Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs

    100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining

    82

    84

    86

    88

    Eval

    Accu

    racy

    (%)

    Resnet-18 (CIFAR-10) - Random Labels

    Rewind to 0Rewind to 500Rewind to 2000Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs

    100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9Percent of Weights Remaining

    86

    88

    90

    92

    94

    Eval

    Accu

    racy

    (%)

    WRN-16-8 (CIFAR-10) - Random Labels

    Rewind to 0Rewind to 250Rewind to 500Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs

    Figure A9: The effect of pre-training CIFAR-10 with random labels. Accompanies Figure 9.

    100.0 51.2 26.3 13.5 6.9 3.6 1.9 1.0 0.5Percent of Weights Remaining

    86

    88

    90

    92

    94

    Eval

    Accu

    racy

    (%)

    VGG-13 (CIFAR-10) - Self-Supervised Rotation

    Rewind to 0Rewind to 250Rewind to 500Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs

    100.0 51.2 26.2 13.4 6.9 3.5 1.8Percent of Weights Remaining

    80

    82

    84

    86

    88

    90

    92

    94

    Eval

    Accu

    racy

    (%)

    Resnet-56 (CIFAR-10) - Self-Supervised Rotation

    Rewind to 0Rewind to 500Rewind to 2000Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs

    100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining

    82

    84

    86

    88

    Eval

    Accu

    racy

    (%)

    Resnet-18 (CIFAR-10) - Self-Supervised Rotation

    Rewind to 0Rewind to 500Rewind to 2000Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs

    100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9Percent of Weights Remaining

    90

    91

    92

    93

    94

    95

    Eval

    Accu

    racy

    (%)

    WRN-16-8 (CIFAR-10) - Self-Supervised Rotation

    Rewind to 0Rewind to 250Rewind to 500Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs

    Figure A10: The effect of pre-training CIFAR-10 with self-supervised rotation. Accompanies Figure9.

    19

  • Published as a conference paper at ICLR 2020

    100.0 51.2 26.3 13.5 6.9 3.6 1.9 1.0 0.5Percent of Weights Remaining

    86

    88

    90

    92

    94

    Eval

    Accu

    racy

    (%)

    VGG-13 (CIFAR-10) - 4x Blurring

    Rewind to 0Rewind to 250Rewind to 500Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs

    100.0 51.2 26.2 13.4 6.9 3.5 1.8Percent of Weights Remaining

    80

    82

    84

    86

    88

    90

    92

    94

    Eval

    Accu

    racy

    (%)

    Resnet-56 (CIFAR-10) - 4x Blurring

    Rewind to 0Rewind to 500Rewind to 2000Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs

    100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining

    82

    84

    86

    88

    Eval

    Accu

    racy

    (%)

    Resnet-18 (CIFAR-10) - 4x Blurring

    Rewind to 0Rewind to 500Rewind to 2000Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs

    100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9Percent of Weights Remaining

    91

    92

    93

    94

    95

    Eval

    Accu

    racy

    (%)

    WRN-16-8 (CIFAR-10) - 4x Blurring

    Rewind to 0Rewind to 250Rewind to 500Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs

    Figure A11: The effect of pre-training CIFAR-10 with 4x blurring. Accompanies Figure 9.

    100.0 51.2 26.3 13.5 6.9 3.6 1.9 1.0 0.5Percent of Weights Remaining

    86

    88

    90

    92

    94

    Eval

    Accu

    racy

    (%)

    VGG-13 (CIFAR-10) - 4x Blurring + Self-Superivsed Rotation

    Rewind to 0Rewind to 250Rewind to 500Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs

    100.0 51.2 26.2 13.4 6.9 3.5 1.8Percent of Weights Remaining

    80

    82

    84

    86

    88

    90

    92

    94

    Eval

    Accu

    racy

    (%)

    Resnet-56 (CIFAR-10) - 4x Blurring + Self-Superivsed Rotation

    Rewind to 0Rewind to 500Rewind to 2000Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs

    100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining

    82

    84

    86

    88

    Eval

    Accu

    racy

    (%)

    Resnet-18 (CIFAR-10) - 4x Blurring + Self-Superivsed Rotation

    Rewind to 0Rewind to 500Rewind to 2000Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs

    100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9Percent of Weights Remaining

    91

    92

    93

    94

    95

    Eval

    Accu

    racy

    (%)

    WRN-16-8 (CIFAR-10) - 4x Blurring + Self-Superivsed Rotation

    Rewind to 0Rewind to 250Rewind to 500Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs

    Figure A12: The effect of pre-training CIFAR-10 with 4x blurring and self-supervised rotation.Accompanies Figure 9.

    20

    IntroductionKnown Phenomena in the Early Phase of TrainingPreliminaries and MethodologyThe State of the Network Early in TrainingPerturbing Neural Networks Early in TrainingAre Signs All You Need?Are Weight Distributions i.i.d.?Is it all just noise?

    The Data-Dependence of Neural Networks Early in TrainingRandom labelsSelf-supervised rotation predictionBlurring training examplesSparse Pretraining

    DiscussionModel detailsExperiments for Other Networks