Less can be More: Sparsity as a Paradigm for LLM Development

Siddharth Sharma

July 25, 2023

min read

Less can be More: Sparsity as a Paradigm for LLM Development

This article was written by Lux Capital summer associate Siddharth Sharma.

“Less is more only when more is too much.” ― Frank Lloyd Wright

Inspiration from PageRank

In the fall of 1996, Google launched a revolution with their secret weapon: the PageRank algorithm. This search tool, pioneering and powerful, harnessed the power of sparsity, the property of being scant or scattered. It distilled the vast, chaotic web of internet pages into an ordered hierarchy, much like an organized library that saves visitors from an overwhelming deluge of information. PageRank's intuition lay in its selective use of sparse connections, paying attention to the most critical links between individual webpages and mostly ignoring the rest.

This discerning focus on critical connections amidst a sea of information echoes once again in the world of Large Language Models (LLMs). Sparsity advocates for a more selective model training approach that can alleviate computational demands and facilitate easier model deployment, offering an alternative to today’s massive models. By embracing sparsity, we're not only refining our models; we're affirming a fundamental principle resonating throughout nature and engineering — complexity can be reduced without compromising efficacy.

For the last decade, increased complexity at scale has been our best friend in deep learning, but it’s time to change the status quo.

What is Sparsity?

The development of primitives in deep learning has long been heavily influenced by our understanding of the human brain, and specifically the principle of sparsity in neural connections. Neurologically, our brains are teeming with neurons, but not every neuron is interconnected with each other. There exists a system of sparse connections that selectively activate during specific tasks or thought processes. This efficient design reduces energy consumption, prevents overstimulation, and allows the brain to handle immense amounts of information with limited resources.

Overparameterized large language models like GPT-3 have proven to generalize well, and can enhance performance, improve the optimization landscape of a problem (with minimal effect on statistical generalization performance), and provide implicit regularization, thereby capturing complex patterns in the data more effectively. However, their training costs both financially and environmentally have grown unsustainably high.

Sparsity, a concept with a long history in machine learning, statistics, neuroscience, and signal processing, offers a solution. Paralleling these principles in AI has led to the development of sparse Large Language Models. The aim of sparsity is to maintain the benefits of overparameterization while reducing the demand for computation and memory. Reducing these demands not only cuts costs but also speeds up model training, eases deployment, democratizes machine learning access, and lessens environmental impact.

The Zoo of Sparsity Methods

Just as the human brain optimizes its functionality by switching off less crucial connections, sparsity techniques aim to distill and prune AI models, retaining only the most impactful elements. Researchers have developed a number of approaches to bring sparsity into AI. For instance, the Pixelated Butterfly method from the HazyResearch lab at Stanford operates at the level of matrix operations. The authors specifically combine specialized butterfly and low-rank matrices to yield a simple and efficient sparse training method. It applies to most major network layers that rely on matrix multiplication.

Another example is Pruning, one of the primary techniques for implementing sparsity. Pruning focuses on reducing parameters in existing networks, either individually or in groups. By automatically reducing elements of the model such as weights, neurons, or even whole layers, pruning helps to reduce the complexity, computational demand, and memory footprint of a model.

Furthermore, distillation serves as a knowledge-transfer process where a larger "teacher" model's knowledge is imparted to a smaller "student" model. This method allows the condensed model to retain the predictive power of the original model, but at a fraction of the computational cost.

Another method that shows promise is Sparse Transformers, a technique where each token attends to a subset of previous tokens, not all tokens, thereby significantly reducing computational and memory requirements. This architecture-focused approach is congruent with how our brains function by focusing only on relevant stimuli.

Finally, Quantization is a technique that reduces the precision of the numbers used in a model, which reduces the model size and speeds up computation.

Sparsity in connection to Scaling

Why is sparsity important and how does it relate to model scaling? Training a Large Language Model requires careful allocation of computational resources. Though a compute-maximization LLM may appear desirable, practical use cases often benefit more from smaller, bang-for-your-buck models. These smaller models, faster and more cost-effective during inference, are more manageable for developers and researchers, especially those with limited GPU resources. With smaller models for instance, the open-source community has more opportunities to iterate on models even on CPU devices. MosaicML (a Lux portfolio company acquired by Databricks) is one company leading the effort towards manageable and scalable model training.

Furthermore, it's crucial to understand the balance between model size and computational overhead. Existing analysis from Kaplan et al. and the Chinchilla paper indicates the existence of a "critical model size," the smallest LLM size that can achieve a specific loss level (model success against training examples).

Yet, recent models such as LLaMA-7B, which is trained on 1 trillion tokens, are far from reaching the critical model size, indicating that there is ample room to train “smaller” LLMs for longer. Any further size reduction beyond this point results in diminishing returns and substantially increased computational costs.

For example, the critical model size is approximately 30% of the Chinchilla optimal model size and incurs a 100% computational overhead. Therefore, sparse models that target this critical model size can provide more efficiency, accessibility, and sustainability, suggesting "less" can indeed be "more."

Scaling Laws and Compute-optimal scaling [Source: Deepmind]

Different layers of Sparsity

Given the numerous benefits, the implementation of sparse language models is not without challenges. The main question is what to exclude from the neural network to strike an optimal balance between computational efficiency and model performance. Classical techniques like dropout (randomly dropping connections with probability p) are hitting walls and models are becoming more token-hungry at scale.

In regards to choice of sparse parameterization, many existing methods, e.g., pruning (Lee et al. 2018, Evci et al. 2020), lottery tickets (Frankle et al. 2018), hashing (Chen et al. 2019, Kitaev et al. 2020) maintain dynamic sparsity masks. However, the overhead of evolving the sparsity mask often slows down (instead of speeds up!) training.

Moreover on the hardware suitability side, most existing methods adopt unstructured sparsity, which may be efficient in theory, but not on hardware such as GPUs (highly optimized for dense computation). An unstructured sparse model with 1% nonzero weights can be as slow as a dense model (Hooker et al. 2020). Layer Agnostic Sparsity is an existing pathway for work in sparsity as well: most existing work targets a single type of operation such as attention (Child et al. 2019, Zaheer et al. 2020), whereas neural networks often compose different modules (attention, multilayer perceptron or MLP). In many applications the MLP layers are the main training bottleneck (Wu et al. 2020). Overall, minimizing dense computation and enabling compression of the model weights will require further work.

Future Directions

The future prospects of sparsity in AI look promising. Imagine efficient models small enough to fit on mobile devices yet capable of providing advanced functionalities, bringing the power of AI to even the most resource-constrained environments and ushering in the true tinyML future. Future implementations could even pave the way towards faster training times for sparse models compared to dense models, increasing their accessibility to the wider machine-learning community.

Moreover, as the focus in industry shifts towards data-centric AI, we foresee how data can dramatically influence a model's learning trajectory. It's possible that a model could maintain its accuracy or quality by only using a subset of training data, thereby accelerating training: data-sparsity could be more crucial than ever.

Sparse matrices can help to accelerate inference [Source: Nvidia]

Asymptoting towards Sparseness

Sparsity, with its roots in biological neuroscience and its future in efficient, democratized AI, will be critical in the evolution of LLMs. Its ability to cut through the computational noise and focus on the most critical aspects of information processing positions it as a key player in the journey towards more efficient, sustainable, and accessible AI models. Breakthroughs in sparsity may result from mathematical advances in how we algorithmically support operations at the matrix or layer level. Advances may also manifest in the interactions between existing sparsity techniques and new forms of hardware and GPU/TPU devices, enabling opportunities to enhance inference.

As attention mechanisms become increasingly memory-efficient and context windows extend into the million-token scale, the benefits of sparse over dense regimes will become ever more obvious. As we approach the era of less is more, it will be worthy to explore, experiment with, and embrace the principles and techniques that sparsity as a paradigm has to offer.

written by

Siddharth Sharma