Pan, Y., Feng, Y., et al. (2025, September 5).
arXiv.org.
https://arxiv.org/abs/2509.05276
Abstract
Mainstream Transformer-based large language models face major efficiency bottlenecks: training computation scales quadratically with sequence length, and inference memory grows linearly, limiting long-context processing. Building large models on non-NVIDIA platforms also poses challenges for stable and efficient training. To address this, we introduce SpikingBrain, a family of brain-inspired models designed for efficient long-context training and inference. SpikingBrain leverages the MetaX GPU cluster and focuses on three aspects: (1) Model Architecture: linear and hybrid-linear attention architectures with adaptive spiking neurons; (2) Algorithmic Optimizations: an efficient, conversion-based training pipeline and a dedicated spike coding framework; (3) System Engineering: customized training frameworks, operator libraries, and parallelism strategies tailored to MetaX hardware.
Using these techniques, we develop two models: SpikingBrain-7B, a linear LLM, and SpikingBrain-76B, a hybrid-linear MoE LLM. These models demonstrate the feasibility of large-scale LLM development on non-NVIDIA platforms. SpikingBrain achieves performance comparable to open-source Transformer baselines while using only about 150B tokens for continual pre-training. Our models significantly improve long-sequence training efficiency and deliver inference with (partially) constant memory and event-driven spiking behavior. For example, SpikingBrain-7B attains over 100x speedup in Time to First Token for 4M-token sequences. Training remains stable for weeks on hundreds of MetaX C550 GPUs, with the 7B model reaching a Model FLOPs Utilization of 23.4 percent. The proposed spiking scheme achieves 69.15 percent sparsity, enabling low-power operation. Overall, this work demonstrates the potential of brain-inspired mechanisms to drive the next generation of efficient and scalable large model design.
Here are some thoughts:
The SpikingBrain project introduces a new family of large language models (LLMs) inspired by how the human brain works, specifically how biological neurons communicate using sparse, event-driven "spikes." The goal is to build powerful AI models that are dramatically more efficient, especially when handling very long documents or conversations, while still matching the performance of today's best open-source models.
Why does this matter? Current LLMs (like those based on the Transformer architecture) are incredibly powerful but also incredibly expensive to train and run. Their computational cost grows quadratically with input length, and they require massive amounts of memory during inference. This makes them impractical for long-context tasks or deployment on edge devices.
SpikingBrain tackles these problems with three big ideas:
- Brain-Inspired Architecture: Instead of standard attention (which is computationally heavy), SpikingBrain uses "linear attention" and hybrid designs that scale linearly with sequence length. This means training and inference stay fast and memory-efficient, even for sequences millions of tokens long. One of their models achieved over a 100x speedup in generating the first response token for a 4-million-token input!
- Efficient Training from Existing Models: Rather than training from scratch (which requires trillions of tokens), SpikingBrain "converts" existing open-source models (like Qwen2.5) using only about 150 billion tokens, roughly 2% of what's normally needed. They also use a clever "MoE upcycling" technique to expand model capacity without massive extra compute.
- Spiking Neurons for Ultra-Low Power: During inference, activations are converted into "spike trains," which are sparse, integer-based signals that mimic how real neurons fire. This achieves ~69% sparsity, meaning most computations are skipped unless a "spike" occurs. On future neuromorphic (brain-like) hardware, this could slash energy consumption by up to 97% compared to standard chips, making it perfect for mobile or embedded AI.
The team built and tested two models:
- SpikingBrain-7B: A lean, linear model optimized for long-context speed.
- SpikingBrain-76B: A larger, hybrid model using Mixture-of-Experts (MoE) for higher performance while keeping efficiency.
Both were trained entirely on MetaX GPUs, a non-NVIDIA platform, proving that cutting-edge, brain-inspired AI can be developed outside the usual hardware ecosystems. The training was stable over weeks, even at 76B parameters, and achieved high computational efficiency.
In tests, these models performed nearly as well as much larger or more expensive Transformer-based models, despite using far less training data and compute. They also demonstrated constant (or near-constant) memory usage during inference, a game-changer for long-document processing.
In short, SpikingBrain shows that by borrowing principles from neuroscience, such as sparse activation, event-driven computation, and efficient memory mechanisms, we can build the next generation of LLMs that are not just smarter, but also faster, leaner, and far more energy-efficient. This opens the door to running powerful AI on everything from data centers to your smartphone without melting your battery.