Paper readthrough: Rethinking the Value of Network Pruning
Attention Conservation Notice Notes on a paper that asks (of a particular kind of model pruning) “should start where we plan to end up, and just train the pruned architecture from scratch?” The answer turns out to be that – once you adjust for total number of FLOPs – just starting with the smaller model with random weights generally works fine. Plus other interesting observations about the structure-vs-initialization question.
Rethinking the Value of Network Pruning – Liu et al, 2019
This paper splits model pruning strategies into two categories:
- “Predefined” pruning where you decide in advance of the training process how you’re going to shrink the model (such as by removing a fixed proportion of weights or nodes from each layer, perhaps in multiple passes), and
- “Automatic” tuning where a deciding how many weights/nodes can be safely removed from the network is learned as part of the training process.
And it splits the kind of alterations to the model into two groups:
- “Structured” alterations change large-scale parameters; removing channels from convolutions, removing nodes from fully connected layers. These kinds of modifications can dramatically reduce the size of the model both on-disk and in-memory, since when you (e.g.) remove a node, you remove all the input and output weights that it has, and you don’t have to store its intermediate value during the forward pass.
- “Unstructured” alterations, by contrast, are essentially individual weight prunings; these make matrices sparser, and so save you on-disk memory usage, but generally don’t have a significant impact on the size of the model in-memory or the computation time (alas, sparse operations on GPU tensors really aren’t a thing yet).
In practice, the “unstructured” pruning is basically always “automatic” – though looking at predefined sparsity patterns for weights is certainly an interesting idea to follow up on! – so they consider three of the four possible cases: predefined structured pruning, automatic structured pruning, and automatic unstructured pruning.
So if you’re doing predefined structured model pruning, it’s not hard to see you know in advance what the final model structure you’ll end up with is. The strategy for actually doing the pruning generally involves alternating between training and pruning steps, at each step “fine-tuning” the model by continuing the training from where it left off, using the remaining (non-pruned) weights. Intuitively, this makes sense: we know that deep neural networks are almost always severely overparameterized, but having a large number of weights lets us find a (probably overfit) solution quickly. We can then trim the model to the critical weights, hopefully slightly regularizing it in the process, without throwing away the ‘important’ weights that presumably give us a good solution. We just do a little bit of fine-tuning to let it adjust to the fact that it has fewer weights now, and we’re good.
This paper doesn’t take for granted that this is a reasonable thing to do, and instead asks “With predefined pruning, we know where we’re going to end up, could we just start with a smaller network?” And the answer, surprisingly, turns out to be “in many cases, yes!”
On the automatic unstructured pruning side, they take aim at the Lottery Ticket Hypothesis (which, if you need a refresher, advanced the hypothesis that smaller networks worked, but required a good and hard-to-define initialization to train effectively, and larger networks basically gave you a better chance of producing one of those small subnetworks with good initialization by luck). In a similar vein as the Zhou et al. paper form Uber paper, they find that the structure of an automatically pruned network is important, but – if you’re fair about training budget w/r/t total number of FLOPS – the initial weights actually aren’t that important. Note that this isn’t incompatible with the LTH results, which show that the pruned network trains faster with the original initialization. Liu et al. simply point out that of the two (initialization weights vs model structure), it’s the structure that is necessary for good performance, which they show by comparison with random weight deletion. The original weights might speed things up (though you can reduce this effect by searching for a better learning rate), but the model will still learn with a random initialization (note that the Uber paper found something similar; they found that keeping the signs of the initialization weights were sufficient to accelerate training significantly).
Finally (though they address this second) they look at automatic structured pruning, which sits in kind of a middle ground. Here they find something that similarly occupies a middle ground: while you don’t know what architecture you’ll end up with until you’ve trained, once you have, you can take the “learned” model structure, retrain it from scratch with the same computational budget, and end up with an as-good or better result. Here, obviously, the use case is a bit less clear since you’ve already spent the GPU time to get the jointly pruned-and-optimized model, so it doesn’t make as much sense to retrain it all over again (versus predefined structured pruning where you should, according to these results, just train the model that you’ll end up with after pruning and budget some extra time for it).
And that’s one key point that they make throughout: most of the time we think about “training time” as “number of epochs”, basically how many times we run the complete data set through the model for the model to learn from. They point out that if you think of ‘training budget’ as ‘the number of floating-point operations that get carried out’, then this actually unfairly penalizes smaller models. If you’ve got half the number of FLOPs, but train for the same number of epochs, then you’re really only giving the small model half the training budget. For most of the predefined pruning approaches, this turns out to make a difference.
The final section of the paper directly dives into the “model pruning as architecture search” question by asking if the model architectures obtained by automatic methods (either unstructured or structured) do better than ones where the number of deleted nodes/weights are the same but they’re deleted at random. Unsurprisingly, the learned architectures usually do better than the randomly pruned ones, but again here’s a case where you have to go through the training/pruning process to find the final architecture, so retraining it from scratch doesn’t make as much sense (even as it is interesting that retraining from scratch still can beat fine-tuning). However in a few cases – PreResNet and DenseNet – this doesn’t hold: random pruning does about the same as guided pruning. They don’t really have a great explanation for this (nor do I, to be fair) other than noting that the automatically pruned models do look like weights were deleted more or less at random.
In the appendix, they take on the LTH paper, and question whether or not the “good initialization” actually matters all that much. They find that – if you really crank up the learning rate – you can get exactly the same (and better) learning curves between the LTH initialization and a random initialization. This reminded me a bit of the “superconvergence” work by Smith and Topin, in which they (rough and necessarily incomplete summary) advocate spending one epoch searching across learning rates to find the maximum rate you can use without the optimization process going completely off the rails, then using that throughout most of training.
So what did we learn from all this? I think the key practical takeaway is that for the specific case of predefined, structured pruning, you’re better off just doing a hyperparameter search; pruning doesn’t seem to get you all that much. For automatic pruning, you should probably fine-tune a bit longer than you planned to. On the theoretical side, this makes a convincing argument that the structure of the network might be more important than the initialization, and raises some interesting questions about what you should do with learning rates as you shrink the size of the model.