The Curious Case of Developmental BERTology

  • by

This essay is written for machine learning researchers and neuroscientists (some jargons in both fields will be used). Though it is not intended to be a comprehensive review of literature, we will take a tour through a selection of classic work and new results from a range of topics, in an attempt to develop the following thesis:

Just like the fruitful interaction between representation learning and perceptual/cognitive neurophysiology, a similar synergy exists between transfer/continual learning, efficient deep learning and developmental neurobiology.

Hopefully it would inspire the reader in one way or two, or at the very least, kill some boredom during a global pandemic.

We are going to touch on the following topics through the lens of large language models:

  • How do overparameterized deep neural nets generalize?
  • How does transfer learning help generalization?
  • How do we make deep learning computationally efficient in practice?
  • In tackling these questions, how might deep learning research benefit and benefit from scientific studies of the developing and aging brain?

If this in-depth educational content on is useful for you, you can subscribe to our AI research mailing list to be alerted when we release new material. 

A philosophical preamble

Before we start, it is prudent to say a few words about the brain metaphor, to clarify this author’s position on the issue as it often arises central at debates.

The confluence of deep learning and neuroscience arguably took place as early as the conception of artificial neural nets, because artificial neurons abstract characteristic behaviors of biological ones [1]. However, the drastically different learning mechanisms and disparities in the kinds of intelligent functions erected a formidable barrier in between the two standing tall for decades. The success of modern deep learning in recent years rekindled another trend of integration, bearing new fruits. In addition to designing AI systems inspired by the brain (e.g. [2]), deep neural nets have recently been proposed to serve as a useful model system to understand how the brain works (e.g. [3]). The benefits are mutual. Progress is being made in reconciliation of the learning mechanisms [4] but, in more than one significant aspect, the intelligence gap obstinately remain [56].

Now, for a deep learning researcher or practitioner looking at this mixed landscape today, is a brain analogy helpful or misleading? It is of course simple to give an answer based on faith, and there are large numbers of believers on both sides. But for now let us not pick a side by belief. Instead, let us evaluate each analogy in its unique context entirely by its practical ramifications: scientifically, it is helpful only if it makes experimentally verifiable/falsifiable predictions, and for engineering, it is useful only if it generates candidate features that can be subject to solid benchmarking. As such, for all brain analogies we are going to raise in the rest of this essay, however appropriate or farfetched they might seem, we shall look past any prior principles and strive to articulate hypotheses that can guide future scientific and engineering work in practice, either within or beyond the limits of these pages.

The working analogy

What do we usually think of a deep neural net when likening it to the brain?

For most, the network architecture maps to the gross anatomy of brain areas (such as in a sensory pathway) and their interconnections, i.e. the connectome, units map to neurons or cell assemblies, and connection weights to synaptic strengths. As such, neurophysiology carries out the computation of model inference.

Learning of deep neural nets typically takes place given a pre-defined network architecture, in the form of optimizing an objective function over a training dataset. (A major difficulty lies in the biological plausibility of artificial learning algorithms, a topic we do not touch in this article — here we simply accept the similarity of function despite the differences in mechanism.) Thus, the data-driven learning by optimization is similar to experience-based neural development, i.e. nurture, whereas network architecture, and to a large degree initialization and some hyperprameters as well, are genetically programmed as a result of evolution, i.e. nature.

Remark: It should be noted that modern deep net architectures, either implicitly engineered by hand or explicitly optimized through neural architecture search (NAS) [7], are also a consequence of data-driven optimization, engendering the inductive bias — the free lunch is paid for by all the unfit that failed to survive natural selection.

Thanks to the rapid growth of data and computing power, the decade of 2010s saw a Cambrian explosion of deep neural net species, spreading rapidly across the world of machine learning.


The plot thickens as the evolution of modern deep learning produces a cluster of new species in the past two years. They thrive in the continent of natural language understanding (NLU), on fertile deltas of mighty rivers carrying immense computing power, such as the Google and the Microsoft. These remarkable creatures share some key commonalities: they all feature a canonical cortical microcircuitry called the transformer [8], have rapidly increasing brain volumes setting historic records (e.g. [91011]) and are often scientifically named after one of the Muppets. But the most prominent common trait of these species crucial to their evolutionary success is the capability of transfer learning.

What does this mean? Well, these creatures have a two-stage neural development: a lengthy, self-supervised larval stage called pre-training followed by a fast, supervised maturation stage called fine-tuning. During self-supervised pre-training, huge corpora of unlabeled text are presented to the subject, who plays with itself by optimizing certain objectives very much similar to solving language quizzes given to human kids, such as completing sentences, filling in missing words, telling logical procession of sentences, and spotting grammatical errors. Then during fine-tuning, a well pre-trained subject can quickly learn to perform a particular language understanding task by supervised training.

Transfer learning’s sweeping conquest of the land of NLU was marked by the advent of bidirectional encoder representations from transformers (BERT) [12]. BERT and its variants have advanced the state-of-the-art by a considerable margin. Their remarkable success piqued tremendous interest in the inner workings of these models, creating the study of “BERTology” (see review [13]). Not unlike neurobiologists, BERTologists stick electrodes into the model brain to record activities for interpretation of the neural code (i.e. activations and attention patterns), make targeted lesions of brain areas (i.e. encoding layers and attention heads) to understand their functions, and study how experiences in early development (i.e. pre-training objectives) contribute to mature behavior (i.e. good performance in NLU tasks).

Network compression

Meanwhile, in the world of deep learning, multi-stage development (like transfer learning) happens in more animal kingdoms than one. Particularly, in production, one often needs to compress a trained huge neural net into a compact one for efficient deployment.

The practice of network compression derives from one of the very puzzling properties of deep neural nets: overparameterization helps not only generalization but optimization as well. That is to say, training a small network is often not only worse than training a large one (if one can afford to do so of course) [14], but also worse than compressing a trained large one to the same small size. In practice, compression can be realized by sparsification (pruning), distillation, etc.

Remark: It is worth noting that the phenomenon of best sparse network arising from optimizing and then compressing a dense one (see e.g. [1516]) is very much like the developing brain, in which over-produced connections are gradually pruned [17].

The type of multi-stage development in model compression, however, is very different from transfer learning. The two stages of transfer learning see the same model being optimized for different objectives, whereas in model compression, the original model morphs into a different one in order to retain optimality for a same objective. If the former resembles maturation to acquire new skills, then the latter is more like graceful aging without losing already learned skills.

Learning weights vs. learning structures: a duality?

When a network is compressed, its structure often undergoes changes. It could mean either the network architecture (e.g. in the case of distillation) or parameter sparseness (e.g. in the case of pruning). These structural changes are usually imposed by heuristics or regularizers that constrain the otherwise already effective optimization.

But can structure rise above being merely an efficiency constraint and become an effective means for learning? An increasing number of emerging studies seem to suggest so.

One intriguing case is weight-agnostic networks [18]. These jellyfish-like creatures do not have to learn during their lifespan, but still are extremely well adapted to their ecological niches, because evolution did all the heavy lifting in choosing an effective brain structure for them.

Even with a fixed architecture chosen by nature, learning sparse structure can still be as effective as learning synaptic weights. Recently, Ramanujan et al. [19] managed to find sparsified versions of initialized convolutional nets which, if made wide and deep enough, generalize no worse than dense ones undergoing weight training. Theoretical investigations also suggest that sparsification of random weights can be just as effective as optimizing parameters if the model is sufficiently overparameterized [2021].

Thus, in the grossly overparameterized regime of modern deep learning, we have in sheath a doubled-edged sword: optimization of weights and of structure. This is reminiscent of both synaptic and structural plasticity as mechanisms underlying biological learning and memory (e.g. see [2223]).

Remark: A formal way of describing parameter sparseness is through the formulation of a parameter mask (Figure 1). Learning can be realized either by optimization of continuous weights within a fixed structure, or by optimization of discrete structure given a fixed set of weights (Figure 2).

Figure 1. The parameter-mask formulation of structural sparseness of model parameters.
Figure 2. Learning weights versus learning structure.

Fine-tuning by sparsification

Now that structure, just like weights, can be optimized for learning, can this mechanism be used to make transfer learning better?

Yes, it can indeed. Recently, Radiya-Dixit & Wang [24] made BERT pick up this new gene and evolve to something new. They showed that BERT can be effectively fine-tuned by sparsification of pre-trained weights without changing their values, as demonstrated systematically with the General Language Understanding Evaluation (GLUE) tasks [25].

Figure 3. Fine-tuning BERT by sparsification [24].

Remark: Note that similar fine-tuning by sparsification has been successfully applied to computer vision, e.g. [26]. Also take note of existing work sparsifying BERT during pre-training [27].

Fine-tuning by sparsification has favorable practical implications. On the one hand, pre-trained parameter values remain the same in learning multiple tasks, reducing task-specific parameter storage to only a binary mask; on the other hand, sparsification compresses the model, potentially obviates many “multiply-by-zero-and-accumulate” operations with proper hardware acceleration. One stone kills two birds.

Beyond the practical benefits, however, the possibility of fine-tuning by sparsification brought about a few new opportunities towards a deeper understanding of language pre-training and its potential connections to the biological brain. Let us take a look of them in the next sections.

Winning tickets of a different lottery

First we study the nature of language pre-training from the perspective of optimization.

It seems that language pre-training meta-learns a good initialization for learning downstream NLU tasks. As Hao et al. [28] recently showed, pre-trained BERT weights have good task-specific optima that are closer and flatter in loss landscape. This means pre-training makes fine-tuning easier, and the fine-tuned solutions generalize better.

Similarly, pre-training also makes discovery of fine-tuned sparse subnetworks easier [24]. As such, interestingly, pre-trained language models have all the key properties of a “winning lottery ticket” as formulated by Frankle and Carbin [29], but of exactly the complementary kind given the duality of optimizing weights vs. structure (Figures 3, 4):

  • The Frankle-Carbin winning ticket is a specific sparse structure that facilitates weight optimization. It is sensitive to weight initialization [29]. It is potentially transferable across vision tasks [30].
  • pre-trained language model is a specific set of weights that facilitates structural optimization. It is sensitive to structural initialization [24]. It is transferable across NLU tasks [24].
Figure 4. The Frankle-Carbin winning ticket [29], cf. fine-tuning by sparsification (Figure 3).

Remark: Note that the “winning ticket” property of pre-trained BERT is different from the wide-and-deep regime as in [19]. It remains an open question whether large transformer-based language models, if made sufficiently wide and deep (bound to be astronomically large provided their already huge sizes), might be effectively fine-tuned from random initializations without pre-training.

Though learning weights of a winning lottery ticket and searching for a subnetwork within pre-trained weights lead to the same outcome — a compact, sparse network that generalizes well, the biological plausibility of the two approaches are drastically different: finding a Frankle-Carbin ticket involves repeated rewinding in time and re-training, a process only possible across multiple biological generations if earlier states could be genetically encoded and then reproduced in the next generation so as to realize rewinding. But weight pre-training followed by structural sparsification are similar to development and aging, all within a single generation. Thus, dense pre-training and sparse fine-tuning might be a useful model for neural development.

Robustness: same function from different structures

Another uncanny similarity between BERT and the brain is its structural robustness.

There seems to be an abundance of good subnetworks of pre-trained BERT at a wide range of sparsity levels [24]: a typical GLUE task can be learned by eliminating from just a few percent to over half of pre-trained weights, with good sparse solutions exist everywhere in between (Figure 5, left). This is reminiscent of structural plasticity at play in the maturing and aging brain — its acquired function remains the same while the underlying structure undergoes continuous changes over time. This is very different from the brittle point solutions by traditional engineering.

Figure 5. Structural robustness of fine-tuned language models by sparsification.(Left) There exist many good subnetworks of pre-trained BERT that span a wide range of sparsity (from a few percent to more than half) [24]. (Right) A cartoonistic view of the loss landscape during continual sparsification. Dense training (solid magenta and orange arrows) finds low-loss solutions lying on a continuous manifold (dotted yellow box similar to Figure 1 of [31]). As long as any structural perturbation by weight elimination (purple dotted arrows and circles) does not deviate far from the low-loss manifold, a quick structural fine-tuning (magenta dotted arrows and circles) can restore optimility, continually. The blue grid represents the discrete set of sparse parameters.

This phenomenon stems primarily from overparameterization of deep neural nets. In the modern regime of gross overparameterization, optima in the loss landscape are typically high-dimensional continuous non-convex manifolds [3132]. This is strangely similar to biology, where identical network behavior can arise from vastly different underlying parameter configurations, forming a non-convex set in the parameter space, e.g. see [33].

Now comes the interesting part. Just like the life-long homeostatic adjustment in biology, a similar mechanism might support continual learning in overparameterized deep nets (illustrated in Figure 5, right): early-stage learning of dense connections finds a good solution manifold, along which an abundance of good sparse solutions exist; as the network ages, continual and gradual sparsification of the network can be quickly fine-tuned by structural plasticity (like the brain that maintains life-long plasticity).

From the neurobiological perspective, if one accepts the optimizational hypothesis [3], then the life-long plasticity must carry out some functional optimization continually during lifespan. Following this logic, neural developmental disorders that arise from this process going awry should essentially be optimizational diseases, with etiological characterizations such as bad initialization, unstable optimizer dynamics, etc.

Whether the aforementioned hypothesis holds true for deep neural nets in general, and adequate for them to serve as a good model for neural development and pathophysiology, are open questions for future research.

How much did BERT learn?

Finally, let us apply some neuroscientific thinking to BERTology.

We ask the question: how much information is stored in pre-trained BERT parameters relevant for solving an NLU task? It is not an easy question to answer because sequential changes in parameter values during pre-training and during fine-tuning confound each other.

This limitation is no longer there in the case of BERT fine-tuned by sparsification, where pre-training only learns weight values and fine-tuning only learns structure. To a biologist, it is always good news if two stages of development involve completely different physiological processes, in which case one of them can be used to study the other.

Now let us do exactly this. Let us perturb the pre-trained weight values and study the downstream consequences. For this experiment, we do not make physiological perturbations (such as lesioning attention heads), but a pharmacological one instead: systemic application of a substance that affects every single synapse in the entire brain. This drug is quantization. Table 1 summarizes some preliminary dose-responses: though BERT and related species have developed large brains, it seems knowledge learned during language pre-training might be described by just a few bits per synapse.

In practice, this means that, since pre-trained weights do not change values during fine-tuning by sparsification, one might only need to store a low-precision integer version of all BERT parameters without any adverse consequences — a significant compression. The upshot: all you need is a quantized integer version of pre-trained parameters shared across all tasks, with a binary mask fine-tuned for each task.

Remark: Note that existing work on quantization of BERT weights quantizes fine-tuned weights (e.g. Q-BERT [34]) instead of pre-trained weights.

Table 1. F1 scores of fine-tuned BERT and related models for MRPC. Thanks to Hugging Face’s transformer, experiments like this are a breeze.


Deep neural nets and the brain have obvious differences: at the lowest level, in learning algorithms, and at the highest level, in general intelligence. Nevertheless, profound similarities at intermediate levels have proven beneficial for the advancement of both deep learning and neuroscience.

For instance, perceptual and cognitive neurophysiology has already inspired effective deep network architectures which in turn make a useful model for understanding the brain. In this essay, we proposed another point of intersection: biological neural development might inspire efficient and robust optimization procedures which in turn serve as a useful model for maturation and aging of the brain.

Remark: It should be noted that neural development in the context of traditional connectionism was proposed in the 1990s (e.g. see [35]).

Specifically, we have reviewed some recent results on weight learning and structural learning as complementary means to optimization, and how they, in combination, realize efficient transfer learning in large language models.

As structural learning becomes increasingly important in deep learning, we shall see corresponding hardware accelerators emerge (e.g. Nvidia’s Ampère architecture supporting sparse weights [36]). This is likely to bring about a new wave of architectural diversification of specialized hardware — acceleration of structural learning requires smart data movement adapted to specific computations, a new frontier for exploration.


[1] W. S. McCulloch and W. Pitts, “A logical calculus of the ideas immanent in nervous activity”, 1943.
[2] D. Hassabis, et al., “Neuroscience-Inspired Artificial Intelligence”, 2017.
[3] B. A. Richards, et al., “A deep learning framework for neuroscience”, 2019.
[4] T. P. Lillicrap, et al., “Backpropagation and the brain”, 2020.
[5] G. Marcus, “Deep Learning: A Critical Appraisal”, 2018.
[6] G. Marcus, “The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence”, 2020.
[7] M. Wistuba, et al., “A Survey on Neural Architecture Search”, 2019.
[8] A. Vaswani, et al., “Attention Is All You Need”, 2017.
[9] M. Shoeybi, et al., “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism”, 2019.
[10] Microsift Research, “Turing-NLG: A 17-billion-parameter language model by Microsoft”, 2020.
[11] T. B. Brown, et al. “Language Models are Few-Shot Learners”, 2020.
[12] J. Devlin, et al., “BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding”, 2018.
[13] A. Rogers, et al., “A Primer in BERTology: What we know about how BERT works”, 2020.
[14] M. Belkin, et al., “Reconciling modern machine-learning practice and the classical bias–variance trade-off”, 2019.
[15] M. Zhu and S. Gupta, “To prune, or not to prune: exploring the efficacy of pruning for model compression”, 2017.
[16] T. Gale, et al., “The State of Sparsity in Deep Neural Networks”, 2019.
[17] S. Navlakha, et al., “Network Design and the Brain”, 2018.
[18] A. Gaier and D. Ha, “Weight Agnostic Neural Networks”, 2019.
[19] V. Ramanujan, et al., “What’s Hidden in a Randomly Weighted Neural Network?”, 2019.
[20] E. Malach, et al., “Proving the Lottery Ticket Hypothesis: Pruning is All You Need”, 2020.
[21] M. Ye, et al., “Good Subnetworks Provably Exist: Pruning via Greedy Forward Selection”, 2020.
[22] F. H. Gage, “Structural plasticity of the adult brain”, 2004.
[23] H. Johansen-Berg, “Structural Plasticity: Rewiring the Brain”, 2007.
[24] E. Radiya-Dixit and X. Wang, “How fine can fine-tuning be? Learning efficient language models”, 2020.
[25] A. Wang, et al., “GLUE: A multi-task benchmark and analysis platform for natural language understanding”, 2019.
[26] A. Mallya, et al., “Piggyback: Adapting a Single Network to Multiple Tasks by Learning toMask Weights”, 2018.
[27] M. A. Gordon, et al., “Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning”, 2020.
[28] Y. Hao, et al., “Visualizing and understanding the effectiveness of BERT”, 2020.
[29] J. Frankle and M. Carbin, “The Lottery Ticket Hypothesis: Finding Small, Trainable Neural Networks”, 2018.
[30] A. S. Morcos, et al., “One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers”, 2019.
[31] F. Draxler, et al., “Essentially No Barriers in Neural Network Energy Landscape”, 2018.
[32] S. Fort and S. Jastrzebski, “Large Scale Structure of Neural Network Loss Landscapes”, 2019.
[33] E. Marder, “Variability, compensation, and modulation in neurons and circuits”, 2011.
[34] S. Shen, et al., “Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT”, 2019.
[35] J. Elman, et al. “Rethinking Innateness: A Connectionist Perspective on Development”, 1996 (ISBN 978–0–262–55030–7).
[36] NVidia Blog, “What Is Sparsity in AI Inference?”, 2020.

This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.

Enjoy this article? Sign up for more applied AI updates.

We’ll let you know when we release more technical education.

The post The Curious Case of Developmental BERTology appeared first on TOPBOTS.

Leave a Reply

Your email address will not be published. Required fields are marked *